本篇博文主要内容为 2026-06-26 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-06-26)
今日共更新690篇论文,其中:
- 自然语言处理共97篇(Computation and Language (cs.CL))
- 人工智能共208篇(Artificial Intelligence (cs.AI))
- 计算机视觉共125篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共163篇(Machine Learning (cs.LG))
- 多智能体系统共11篇(Multiagent Systems (cs.MA))
- 信息检索共22篇(Information Retrieval (cs.IR))
- 人机交互共33篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] Resilient Output Containment under Undisclosed Leader Dynamics and Actuator Attacks
【速读】:该论文旨在解决在有向网络拓扑下,存在执行器网络攻击时,异构线性多智能体系统中鲁棒输出包含控制问题。具体而言,领导者生成有界且局部绝对连续的轨迹,但其动态特性、速度约束及运动包络对跟随者不可知;同时,攻击模型涵盖状态与输入相关的虚假数据注入以及有界的外部干扰项。解决方案的关键在于提出一种连续的双层自适应控制架构:第一层为虚拟执行器重构层,利用局部状态测量信息补偿执行器攻击对局部跟踪误差动态的影响;第二层为网络接口层,通过自适应交互协议生成任务空间指令,仅需交换邻域间与被控对象输出维度一致的状态信息,且无需全局图结构知识即可完成参数调节。针对有向图,在“领导根连通生成树”条件下,采用非光滑李雅普诺夫分析方法,证明了命令层面的渐近包含;进而,物理输出可收敛至领导者凸包,收敛精度由命令跟踪局部控制器决定的残差水平所限制。仿真结果基于带有阻尼悬挂负载的四旋翼无人机网络验证了该方法在攻击恢复与包含跟踪方面的有效性。
链接: https://arxiv.org/abs/2606.27257
作者: Mohammadreza Nematollahi,Khashayar Khorasani,Nader Meskin
机构: Concordia University (康考迪亚大学); Qatar University (卡塔尔大学)
类目: ystems and Control (eess.SY); Multiagent Systems (cs.MA)
备注: 21 pages, 12 Figures
Abstract:This work studies resilient output containment for heterogeneous linear multi-agent systems with actuator cyber-attacks over directed network topologies. The leaders generate bounded locally absolutely continuous trajectories; however, their dynamics, velocity bounds, and motion envelopes are undisclosed to the followers. The cyber-attack model includes state- and input-correlated, as well as bounded exogenous actuator false-data terms. A continuous two-layer adaptive control architecture is proposed. The first layer is a virtual-actuator reconfiguration layer that uses partial state measurements to compensate for actuator attacks in the local tracking-error dynamics. The second layer is a network interface that generates task-space commands via an adaptive interaction protocol. This protocol uses only neighbor-exchanged network-interface states whose dimensions match those of the plant output, and it does not require global graph knowledge for parameter tuning. For directed graphs, under a leader-rooted united spanning-tree condition, a nonsmooth Lyapunov analysis yields asymptotic containment at the command level. The physical outputs then converge to the leader convex hull up to a residual determined by the command-tracking local controllers. Simulation results using a network of quadrotors with damped suspended loads illustrate the performance of attack recovery and containment tracking.
[MA-1] Mostly Automatic Translation of Language Interpreters from C to Safe Rust
【速读】:该论文旨在解决将现实世界中的C语言解释器程序自动翻译为安全的Rust代码所面临的挑战,核心问题在于C与Rust在类型约束、所有权(ownership)及借用(borrowing)规则上的显著差异,尤其针对处理不受信任输入且易受内存相关漏洞(如堆缓冲区溢出、使用后释放)影响的解释器类程序。其解决方案的关键在于提出Reboot这一几乎全自动的翻译技术,通过两个核心机制实现:一是特征削减(feature reduction),将翻译过程分解为按程序特性逐步恢复的多个里程碑阶段,每个阶段均为可测试的完整程序,从最简版本开始增量重构,确保每一步都经过验证后再推进;二是多智能体架构(multi-agent architecture),通过自动化验证与反馈机制协调多个不可靠的编码智能体,有效管理长周期翻译流程,最大限度减少人工干预。实证结果表明,该方法成功将6个规模在6k至23k行代码之间的真实解释器翻译为安全的Rust代码,平均仅需1至11次用户干预,所有翻译均通过100%的原生测试套件,并在独立创建的验证测试中达到62%–92%的通过率;对mujs的安全部署案例进一步证明,原始C版本中存在的内存漏洞在Rust翻译后被彻底消除。消融实验显示,相较于仅依赖多智能体翻译,引入特征削减可使验证测试通过率提升6%–20%,显著提升翻译正确性。
链接: https://arxiv.org/abs/2606.27122
作者: Bo Wang,Brandon Paulsen,Joey Dodds,Daniel Kroening,Umang Mathur,Prateek Saxena
机构: National University of Singapore(新加坡国立大学); Amazon(亚马逊)
类目: Programming Languages (cs.PL); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注:
Abstract:Translating C programs to safe Rust is challenging owing to significant differences in typing constraints, ownership, and borrowing rules. Interpreter programs are particularly important targets for such translation, as they often handle untrusted inputs and suffer from memory-related vulnerabilities. We present Reboot, a mostly-automatic technique that translates real-world interpreter programs from C to safe Rust. Using Reboot, we have translated six interpreters ranging from 6k to 23k lines of C code to safe Rust, with each translation requiring only 1 to 11 brief user interventions. All translations pass 100% of the provided test suites, and achieve 62%–92% pass rates on separately created validation tests that were never exposed to the system. A security case study on mujs shows that memory vulnerabilities such as heap buffer overflows and use-after-free present in C are eliminated in the safe Rust translation. Two ideas underpin Reboot. First, feature reduction decomposes the translation by program features, creating a sequence of milestones where each is a complete, testable program; the translation starts from the simplest version and incrementally restores features, with each milestone validated before proceeding. Second, a multi-agent architecture orchestrates inherently unreliable coding agents through automated validation and feedback, keeping long-running translation workflows on track with minimal human involvement. An ablation study confirms that feature reduction improves translation correctness compared to using multi-agent translation alone, with 6%–20% improvements in pass rates on validation test suites.
[MA-2] Scalability of Morality: A Particle-Based Numerical Study on the Decoupling of Law and Ethics in Large-Scale Populations
【速读】:该论文旨在解决大规模社会系统中道德规范与正式法律之间系统性脱节的问题,特别是在人口规模扩张导致个体间局部互动减少、认知记忆负荷超限的背景下,如何维持去中心化的社会伦理秩序。其核心解决方案在于构建一种基于粒子的计算框架,将个体建模为具有有限记忆容量(L)和动态演化、随机选择行为特征(μ)的离散粒子,并通过非线性社会压力开关调控其行为。研究表明,当人口规模(N)远超过个体记忆容量(N ≫ L)时,个体重逢概率以 𝒪(L/N) 的形式衰减,导致去中心化同行问责机制被结构性稀释,进而引发全球行为规范与道德基准的解耦,趋向最低限度的法律底线。此外,循环尺度实验揭示了显著的路径依赖型滞后环(hysteresis loop),从数学上刻画了自组织社会系统中道德退化过程的非马尔可夫性与不可逆性,从而揭示了道德可持续性的关键约束条件。
链接: https://arxiv.org/abs/2606.27039
作者: Amir Arslan Haghrah,Amir Aslan Haghrah
机构: 未知
类目: ystems and Control (eess.SY); Multiagent Systems (cs.MA)
备注:
Abstract:This study introduces a particle-based computational framework to investigate the scalability of morality and the systemic decoupling of formal law from decentralized social ethics in expanding populations. While micro-societies reinforce ethical conduct through local reciprocity, macroscale systems introduce anonymity that strains cognitive memory limitations. We model individual agents as discrete particles with finite memory capacities ( L ) and dynamically evolving, stochastic choice profiles ( \mu ) regulated by non-linear social pressure switches. Monte Carlo ensemble simulations demonstrate a distinct, non-linear phase transition as the population scales ( N \to \infty ). When the population metric outpaces memory capacity ( N \gg L ), the local re-encounter probability drops as \mathcalO(L/N) . This structural dilution neutralizes decentralized peer-to-peer accountability, causing global behavioral norms to decouple from moral baselines and drift toward a minimalist legal floor. Furthermore, cyclic scale experiments expose a prominent, path-dependent hysteresis loop, mathematically formalizing the non-Markovian inertia and irreversible nature of moral decay in self-organizing social systems.
[MA-3] Semantic Early-Stopping for Iterative LLM Agent Loops ALT
【速读】:该论文旨在解决多智能体大语言模型(Multi-agent LLM)循环中因采用固定迭代次数上限(max_iterations)作为终止条件所导致的效率与效果失衡问题。这种硬性终止机制无视答案是否仍在持续优化,导致对简单输入过度消耗计算资源,而对复杂任务则可能过早终止,影响最终质量。其核心解决方案是提出一种语义层面的早期停止策略(semantic early-stopping),即当连续生成草稿的嵌入表示在语义上不再变化(通过余弦距离结合耐心窗口判定)且答案质量不再提升时,自动终止循环。该方案的关键在于:第一,建立严谨的理论基础,形式化证明了算法的确定性终止与良定义性,并通过机器验证确保逻辑正确性;第二,设计了一种高效的评估协议,仅需生成一次完整轨迹,即可对不同停止策略进行严格配对比较,显著降低评估成本,并明确区分操作令牌(用于策略执行)与评估令牌(用于质量测量);第三,在多跳检索增强问答(HotpotQA)任务上的实证研究表明,无需人工评判的语义停止器可在保持质量不变的前提下将操作令牌减少38%,而全量质量门控版本因每轮评判开销过大反而得不偿失。此外,一个理想化的“最优轮次选择”(oracle)可实现+0.115的信息得分提升,揭示出该问题的本质已从“何时停止”(易解)转变为“哪一轮最优”(仍开放)。
链接: https://arxiv.org/abs/2606.27009
作者: Sahil Shrivastava
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 7 pages, 5 figures, 4 tables. Open implementation, machine-checked theory, and reproducible harness: this http URL
Abstract:Multi-agent large language model (LLM) loops, for example a Writer that drafts and a Critic that revises, are almost always terminated by a fixed iteration cap (max_iterations). This is a syntactic kill-switch: it is blind to whether the answer is still improving, so it over-spends tokens on easy inputs and truncates hard ones. We study semantic early-stopping: the loop halts when consecutive draft embeddings stop changing in meaning (cosine distance with a patience window) and the answer’s measured quality stops improving. Our work makes three contributions. First, an honest theoretical footing: we prove deterministic termination and well-definedness and machine-check these claims, while treating the convergence of the distance sequence as an empirically tested conjecture rather than a (previously over-claimed) Banach contraction. Second, a judge-efficient evaluation protocol: we generate each question’s full trajectory once, replay every stopping policy over the identical drafts, and cache every LLM-judge call, yielding a strictly paired efficiency-versus-quality comparison at low cost; we further separate operational tokens (charged to a policy) from evaluation tokens (a measurement instrument). Third, an empirical study on multi-hop retrieval-augmented question answering (HotpotQA). On the 60-question test split, a judge-free semantic stopper reduces operational tokens by 38% relative to max_iterations at parity quality (Delta-IS = -0.004, p = 0.81), whereas the full quality-gated variant is counter-productive because its per-round judging dominates cost. An oracle that selects the best round attains +0.115 Information Score over every practical policy (p ~ 4e-11), reframing the problem from “when to stop” (easy) to “which round is best” (open).
[MA-4] Scientific discovery as meta-optimization: a combinatorial optimization case study
【速读】:该论文旨在解决科学发现中传统优化方法难以兼顾评价标准动态适应性的问题,即在面对复杂、高维的理论与实验“状态空间”时,固定不变的评估准则可能限制对新颖且高质量解决方案的探索。其核心挑战在于如何在自动化搜索过程中同时优化目标函数本身,以实现更高效、更具适应性的科研发现。解决方案的关键是提出将科学研究形式化为“元优化”(meta-optimization),即把优化目标本身也作为可优化变量;具体创新在于引入“共识目标聚合”(consensus objective aggregation)机制,通过相关性加权投票的方式融合多个由大语言模型(LLM)生成的目标函数,从而构建一个稳定、自修正且随认知深化持续演化的评估标准。该方法成功应用于基于数字记忆计算(digital MemComputing)机器的3-SAT算法发现任务中,使问题规模N的渐近复杂度从约N2.51降至约N1.33,在最大测试实例上实现了约67倍的速度提升,验证了其作为通用型科学发现框架的有效性。
链接: https://arxiv.org/abs/2606.26728
作者: Yuan-Hang Zhang,Chesson Sipling,Massimiliano Di Ventra
机构: University of California San Diego (加州大学圣地亚哥分校)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 35 pages, 6 figures
Abstract:Scientific discovery is fundamentally an optimization problem, defined by a vast “state space” of theories and experiments, and an evaluation criterion based on quality, novelty, and validity. Large language models (LLMs) have enabled automated exploration of this space, but we argue that simultaneous modification of the evaluation criteria is equally important. Here, we propose formalizing research as meta-optimization, where the optimization objective itself is also being optimized. Our key contribution is “consensus objective aggregation,” where LLM-generated objective functions are combined via correlation-weighted voting, yielding a stable, self-correcting evaluation criterion that evolves as understanding deepens. We apply this framework to algorithm discovery for 3-SAT problems based on digital MemComputing machines, reducing the baseline scaling with problem size N from \sim N^2.51 to \sim N^1.33 and delivering a \sim 67\times speedup on the largest instances tested. As a problem-agnostic framework, we hope this approach will considerably aid scientific discovery.
[MA-5] SOLAR: AI-Powered Speed-of-Light Performance Analysis
【速读】:该论文旨在解决深度学习模型在目标硬件上运行速度的理论极限问题,即“模型在给定架构下的理论最小执行时间”这一核心瓶颈,进而评估当前实现距离理论极限的差距。传统方法中的“光速分析(Speed-of-Light, SOL)”虽能提供理论下界,但其边界推导依赖人工操作,易出错且与快速迭代的模型开发流程脱节。为弥合这一鸿沟,论文提出SOLAR框架,其关键在于构建一个自动化的、可验证的SOL边界推导流程:该框架采用生成式与确定性组件协同的工作流——首先利用大语言模型(LLM)前端将PyTorch和JAX源代码自动转换为可执行的仿射循环中间表示(Affine Loop IR),并通过输出对比验证其正确性;随后通过确定性流程将IR提升为einsum图;最后由解析后端计算未融合、融合及缓存感知的SOL边界。SOLAR实现了对多种算子和编程语言的全面覆盖,生成的边界经实验验证无任何SOL违规现象,并支持多保真度分析,能够不断收紧边界并揭示优化线索。在KernelBench、JAX/Flax模型及机器人工作负载上的评估表明,SOLAR可有效支持四种应用场景:多保真度性能余量分析、优化机会识别、跨平台探索以及反向屋顶线(inverse-roofline)硬件资源配置。
链接: https://arxiv.org/abs/2606.26383
作者: Qijing Huang,Sana Damani,Zhifan Ye,Athinagoras Skiadopoulos,Siva Kumar Sastry Hari,Jason Clemons,Sahil Modi,Jingquan Wang,Aditya Kane,Edward C Lin,Humphrey Shi,Christos Kozyrakis
机构: NVIDIA(英伟达)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Multiagent Systems (cs.MA); Performance (cs.PF)
备注:
Abstract:How fast could a deep-learning model run on target hardware, and how far is today’s implementation from that limit? These questions are central to software, hardware, and algorithm optimizations. Speed-of-Light (SOL) analysis answers them by computing a workload’s theoretical minimum execution time on a given architecture. Yet deriving SOL bounds remains manual, error-prone, and disconnected from rapid model development. To close this gap, we introduce SOLAR, a framework that automatically derives validated SOL bounds from PyTorch and JAX source code. SOLAR leverages both generative and deterministic components in its flow: an LLM frontend translates any source programs into an executable Affine Loop IR, validated by output comparison; a deterministic flow lifts the IR into an einsum graph; and an analytical backend computes unfused, fused, and cache-aware SOL bounds. SOLAR provides comprehensive operator and language coverage, produces validated bounds with zero observed SOL violations, and offers multi-fidelity analysis that tightens bounds and surfaces optimization insights. We evaluate SOLAR across KernelBench, JAX/Flax models, and robotics workloads. These experiments demonstrate four use cases: headroom analysis at multiple fidelity levels, identifying optimization opportunities, cross-platform exploration, and inverse-roofline hardware provisioning.
[MA-6] Instruction Bleed: Cross-Module Interference in Prompt-Composed Agent ic Systems ICML2026
【速读】:该论文旨在解决提示工程构成的智能体系统中普遍存在的一种隐蔽性故障模式——组合行为泄漏(Compositional Behavioral Leakage, CBL),即在无共享变量或执行依赖关系的情况下,修改某一提示模块会无声地影响其他模块的行为。其核心问题是:由于Transformer架构中的自注意力机制缺乏对拼接模块间的正式边界隔离,导致模块间通过共享的上下文窗口产生隐式干扰。解决方案的关键在于提出了一种可复用的三通道探测协议,通过分别扰动非焦点模块的体积、内容和形式,量化其对整体行为的影响;实验结果表明,仅内容扰动产生显著效应(Cohen’s d = 0.63,Bootstrap 95%置信区间不包含零),且该效应虽未触发单个决策翻转(处于亚阈值状态),但会在数千次部署决策中累积放大。研究进一步指出CBL与已知的智能体失效维度(如对抗注入、认知退化、多智能体故障传播、隐私泄露)正交,从而确立了跨模块干扰测量作为提示式智能体评估的必要标准,并贡献了操作定义、可复现的评估协议、可证伪的预测集以及系统级分类框架。
链接: https://arxiv.org/abs/2606.26356
作者: Ching-Yu Lin,Yifan Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注: 8 pages, 2 tables. Accepted to the ICML 2026 Workshop on Failure Modes in Agentic AI (FAGEN), Seoul, South Korea
Abstract:Practitioners of prompt-composed agentic systems report a recurring failure mode: editing one prompt module silently shifts the behavior of others despite no shared variable or executable dependency. We formalize this as compositional behavioral leakage (CBL): interference between modules sharing a context window. CBL is enabled by architectural non-isolation: transformer self-attention provides no formal boundary between concatenated modules. We probe CBL on a deployed job-evaluation agent (Claude Sonnet 4.6, 144 trials) through a reusable three-channel protocol that perturbs non-focal modules along volume, content, and form. Only the content channel produces a detectable paired effect (Cohen’s d = 0.63, bootstrap 95% CI excluding zero); no recommendation flipped – a sub-threshold regime invisible to standard QA but compounding across the thousands of decisions a deployed agent makes. CBL is orthogonal to known agent-failure axes (adversarial injection, cognitive degradation, multi-agent fault propagation, privacy leakage). We contribute an operational definition, a reusable protocol, a falsifiable prediction set, and a system-class characterization, establishing cross-module interference measurement as a requirement for prompt-composed agent evaluation.
[MA-7] he Red Queen Gödel Machine: Co-Evolving Agents and Their Evaluators
【速读】:该论文旨在解决当前自改进智能体(self-improving agents)在递归自我优化过程中对静态评估标准的依赖问题。现有方法通常假设验证器、基准数据集或标签集在整个优化过程中保持不变,忽略了进化中“物种随环境共同演化”的核心特征。为克服这一局限,论文提出将评估机制纳入自我改进循环,引入动态演化的评估者、对抗性目标和可变效用函数,以突破静态基准的性能瓶颈。其解决方案的关键在于提出红后戈德尔机(Red Queen Godel Machine, RQGM),一个基于受控效用演化的递归自我改进框架:通过将搜索过程划分为多个周期(epoch),每个周期内维持固定的评估标准以保证自我改进的数学保证,而在周期边界处允许效用函数更新,从而实现非平稳效用下的持续优化。实验表明,在可验证编码任务中,RQGM通过引入“代理作为评审员”的代码审查信号,以1.35–1.72倍更少的词元消耗实现了优于先前最先进(SOTA)模型的测试通过率;在科学论文撰写与评审、奥数级证明生成与评分等复杂任务中,共演化写作者的接受率提升至1.78–1.86倍,共演化评分者的真实准确率提高9%;同时,针对基线评审器对人工智能生成论文过度接受的问题(最高达人类水平的1.91倍),RQGM通过引入对抗性目标,使评审器对人工与AI生成内容保持同等严格度,显著提升了评估公平性与鲁棒性。
链接: https://arxiv.org/abs/2606.26294
作者: Alex Iacob,Andrej Jovanović,William F. Shen,Daniel Burkhardt,Meghdad Kurmanji,Nurbek Tastan,Lorenzo Sani,Niccolò Alberto Elia Venanzi,Ambroise Odonnat,Zeyu Cao,Bill Marino,Xinchi Qiu,Nicholas D. Lane
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE)
备注: 12 pages main text + 21 pages appendix (37 pages total, incl. references); 10 figures (6 main text + 4 appendix); 10 tables (2 main text + 8 appendix). Preliminary preprint; work in progress. Keywords: self-improving agents, learned evaluation, multi-agent systems, auto- mated scientific discovery, controlled utility evolution, co-evolutionary search, autoresearch
Abstract:Self-improving agents are state-of-the-art (SOTA) on agentic coding benchmarks and have recently been extended to general domains. However, their search methods generally assume a stationary evaluation criterion: a fixed verifier, benchmark, or labeled dataset that remains valid as the agent improves. This ignores a central feature of evolution: species adapt as their environments change with them. We aim to bring the same principle to recursive self-improvement, making evaluation part of the improvement loop and opening search to evolving evaluators, adversarial objectives, and dynamic utilities that may surpass static benchmarks. We introduce the Red Queen Godel Machine (RQGM), an evolutionary framework for recursive self-improvement under non-stationary utilities. The RQGM makes this possible through controlled utility evolution: search is organized into epochs with a fixed within-epoch evaluation criterion, while the utility can be updated at epoch boundaries, so self-improvement guarantees hold per epoch as the objective evolves across them. We begin by showing that even on verifiable coding tasks, the RQGM improves test pass rate over the prior SOTA by adding a complementary agent-as-a-judge code-review signal. This signal is cheaper and the RQGM uses 1.35x-1.72x fewer tokens. We then turn to scientific paper writing and reviewing, and Olympiad-level proof writing and grading, where the RQGM improves performance over prior self-improving agents: co-evolved writers reach 1.78x-1.86x higher acceptance rates under a diverse agent-as-a-judge panel, while co-evolved graders reach 9% higher ground-truth accuracy. In paper reviewing, the strongest baseline reviewer over-accepts AI-generated papers at up to 1.91x the human rate. The RQGM corrects this by introducing an adversarial objective that discovers reviewers equally stringent on AI and human work.
[MA-8] Agent ic Analysis for Agent ic Infrastructure: An LLM -Powered Pipeline for Comparative Governance of DAO and Corporate AI Protocols
【速读】:该论文旨在解决生成式 AI(Generative AI)代理协议日益增多背景下,其互操作性标准所依赖的治理结构缺乏实证研究的问题。其核心挑战在于如何在大规模、复杂的技术治理话语中识别并分析社会—技术权力结构的形成机制。解决方案的关键在于提出一种基于大语言模型(LLM)的对比分析流水线,集成自动化标注、神经主题建模与多层网络分析,实现对治理话语的系统性量化与可视化。通过在两种截然不同的代理互操作性标准——ERC-8004(去中心化、链上许可自由)与Google A2A(企业主导)——上的验证,该方法揭示了制度设计虽影响议题聚焦,但两类体系均表现出相似程度的参与不平等与社区碎片化;而开放治理环境下的语义话语一致性更高,表明去中心化治理可能在分散参与的同时促进主题趋同。这一研究证明了LLM辅助方法在推动技术治理实证研究中的有效性,并为构建更具公平性的智能代理标准提供了方法论支持与实践启示。
链接: https://arxiv.org/abs/2606.26203
作者: Yutian Wang,Luyao Zhang
机构: Duke Kunshan University (杜克-昆山大学); Duke Kunshan University (杜克-昆山大学)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:As AI agent protocols proliferate, the governance structures shaping their interoperability standards remain empirically underexamined. We introduce an LLM-powered comparative pipeline for large-scale governance discourse analysis, integrating automated annotation, neural topic modeling, and multi-layer network analysis to study socio-technical power structures at scale. We validate it on two contrasting standards for agent interoperability: ERC-8004 (permissionless, on-chain) and Google A2A (corporate-led). Analyzing 4,323 governance participation records, we combine LLM-assisted coding, topic modeling, and multi-layer network analysis to examine how institutional design shapes thematic priorities and community structure. We find that while governance form influences substantive focus, both regimes exhibit comparable levels of participation inequality and community fragmentation. Discourse alignment is denser in the permissionless setting, suggesting that open governance may foster greater thematic convergence despite decentralized participation. These findings illustrate how LLM-assisted methods can advance the empirical study of technology governance, with implications for designing more equitable agentic AI standards. All data and code are openly available.
[MA-9] Kiko: Programming Agents to Enact Interaction Protocols
【速读】:该论文旨在解决多智能体系统(Multiagent System)中智能体在去中心化决策环境下,其内部决策逻辑与外部公开行为之间缺乏有效抽象和衔接的问题。现有智能体编程模型在决策抽象层面表现不足,难以清晰表达智能体如何基于协议进行交互并做出一致的通信决策。为此,论文提出Kiko——一种基于协议的智能体编程模型,其核心解决方案在于引入“决策者”(decision maker)机制:开发者通过编写一个或多个决策者,每个决策者从一组合法动作中选择并生成相互兼容的发送消息决策。Kiko通过完全抽象底层通信服务,并支持实用的决策模式,使开发人员能够聚焦于业务逻辑本身。研究进一步为Kiko提供了操作语义,并证明了基于Kiko构建的智能体具备协议合规性,且能够实现任意协议的执行。
链接: https://arxiv.org/abs/2606.26156
作者: Samuel H. Christie V,Munindar P. Singh,Amit K. Chopra
机构: North Carolina State University (北卡罗来纳州立大学); Lancaster University (兰卡斯特大学)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Realizing a multiagent system involves implementing member agents who interact based on a protocol while making decisions in a decentralized manner. Current programming models for agents offer poor abstractions for decision making and fail to adequately bridge an agent’s internal decision logic with its public decisions. We present Kiko, a protocol-based programming model for agents. To implement an agent, a programmer writes one or more decision makers, each of which chooses from among a set of valid decisions and makes mutually compatible decisions on what messages to send. By completely abstracting away the underlying communication service and by supporting practical decision-making patterns, Kiko enables agent developers to focus on business logic. We provide an operational semantics for Kiko and establish that Kiko agents are protocol compliant and able to realize any protocol enactment. Subjects: Multiagent Systems (cs.MA) Cite as: arXiv:2606.26156 [cs.MA] (or arXiv:2606.26156v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2606.26156 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-10] Simulating Eating Disorder Patients with LLM s: Evaluating Psychological Persona Stability in Multi-Turn Conversations
【速读】:该论文旨在解决生成式临床患者模拟中人格稳定性(persona stability)的验证问题,即大语言模型(Large Language Model, LLM)在跨对话与单对话内部能否一致且准确地维持预设的心理特征。其核心挑战在于确保模拟个体的心理画像在不同情境下保持连贯性,同时真实反映临床病例的严重程度。研究采用基于五则已发表进食障碍案例描述的虚构人格,结合自我报告与独立观察者评分的双重评估框架,以及具备已知真值分数的标准化心理量表(如EDE-Q),对六种主流LLM进行系统评估。结果显示,尽管模型表现出过度稳定的特性(跨对话与内对话间变异极低),但其表现存在系统性偏差:所有模型均显著高估病情严重度,误差达量表范围的12%-30%(在0-6分量表上高出0.7-1.8分)。根本机制为选择性刻板印象(selective stereotyping)——模型能够区分行为维度(如饮食控制)的差异,却在认知-情绪维度(如身体不满、体重关注)上无论实际严重程度如何均将其推向量表上限。额外对话上下文并未提升准确性,反而加剧了高估现象。因此,当前LLM虽可生动呈现重度进食障碍特征,却缺失对中等程度临床表现的建模能力,形成“缺失中间层”(missing middle)问题。
链接: https://arxiv.org/abs/2606.26109
作者: Jennifer Haase,Jana Gonnermann-Müller,See Heng Yim,Nicolas Leins,Jan Mendling,Sebastian Pokutta
机构: Weizenbaum Institute, Berlin, Germany; HU Berlin, Berlin, Germany; Zuse Institute Berlin, Berlin, Germany; Department of Psychology, University of Hong Kong, Hong Kong
类目: Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注:
Abstract:Large language model (LLM)-based simulations of clinical patients are increasingly used for research and training, yet their validity requires persona stability: coherent maintenance of an assigned psychological profile across and within conversations. We evaluate this prerequisite using eating disorder personas grounded in five published case vignettes, a dual-assessment framework (self-report + independent observer ratings), and validated psychometric instruments (EDE-Q) with known ground-truth scores. Across six LLMs and two experiments (between-conversation stability (Exp. I) and within-conversation stability (Exp. II)), we find that LLMs are paradoxically too stable and too inaccurate: variability is negligible, yet all models systematically overshoot ground-truth severity by 12-30% of the scale range (0.7-1.8 points on a 0-6 scale). The mechanism is selective stereotyping: models differentiate cases on behavioural items (dietary restraint) but maximise cognitive-affective items (body dissatisfaction, weight preoccupation) at ceiling regardless of case severity. Additional conversational context does not improve accuracy; it compounds the overshoot. LLMs can portray severe eating pathology but lack a representation of moderate clinical presentations, a “missing middle”.
自然语言处理
[NLP-0] DanceOPD: On-Policy Generative Field Distillation
【速读】: 该论文旨在解决现代图像生成模型中多能力统一训练的核心挑战,即如何在保持文本到图像(text-to-image, T2I)生成质量的同时,有效融合局部编辑与全局编辑等多样化功能,而这些能力通常存在内在冲突。传统方法中,编辑操作会损害T2I性能,且局部与全局编辑之间相互干扰,导致能力间难以协同。为此,论文提出DanceOPD——一种面向流匹配(flow-matching)模型的在线策略生成场蒸馏框架,其关键在于将每种能力建模为共享流状态空间上的速度场(velocity field),通过引导样本至特定能力场,查询由学生模型自身演化出的低噪声状态,并以简单的速度均方误差(velocity MSE)目标进行训练。该方法使学生模型基于自身轨迹状态学习来自各专家能力场的知识,实现跨能力的动态组合;同时,该框架可自然吸收如无分类器引导(classifier-free guidance, CFG)等外部操作符。大量实验表明,DanceOPD显著提升了多能力协同生成效果,在增强目标能力的同时,维持了原始生成质量,为流匹配模型中的生成场蒸馏提供了实用且可扩展的技术路径。
链接: https://arxiv.org/abs/2606.27377
作者: Wei Zhou,Xiongwei Zhu,Zelin Xu,Bo Dong,Lixue Gong,Yongyuan Liang,Meng Chu,Leigang Qu,Lingdong Kong,Wei Liu,Tat-Seng Chua
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Technical Report; 39 pages, 13 figures, 9 tables; Project Page at this https URL
Abstract:Modern image generation demands a single model that unifies diverse capabilities, including text-to-image (T2I), local editing, and global editing. However, these capabilities are rarely naturally aligned and often conflict. For instance, editing tends to degrade T2I performance, while global and local editing interfere with each other. Consequently, effectively composing these capabilities has become a central challenge for image generation model training. To tackle this, we introduce DanceOPD, an on-policy generative field distillation framework for flow-matching models that routes each sample to one capability field, queries one low-noise student-induced state, and trains with a simple velocity MSE objective. With each capability source defined as a velocity field over the shared flow state space, the student learns from fields queried on its own rollout states to compose expert capabilities. This formulation also absorbs operator-defined fields such as classifier-free guidance. Comprehensive experiments on T2I, editing, realism-field absorption, and CFG absorption show that our approach improves multi-capability composition, strengthening target capabilities while preserving anchor generation quality. We believe this work establishes a practical route for generative field distillation in flow-matching models.
[NLP-1] Mapping Political-Elite Networks in Europe with a Multilingual Joint Entity-Relation Extraction Pipeline
【速读】: 该论文旨在解决在比较政治学中如何大规模、自动化地识别政治精英是否形成寻租联盟或公民网络以维持治理这一核心问题。传统方法依赖人工编码,难以实现规模化;而现有自动文本分析方法多局限于简单的共现分析,且常受限于专有API、缺乏跨语言能力及实体消解的可扩展性。其解决方案的关键在于提出一个模块化、全开源权重的多语言联合实体-关系抽取流水线,能够从海量非结构化新闻语料中构建带符号的时间知识图谱。该方法首先采用基于跨度的命名实体识别(Named Entity Recognition, NER)结合三阶段链接级联机制,将文本提及映射至语言无关的Wikidata标识符;随后通过高吞吐量、本体约束的专家混合模型(mixture-of-experts),借助引导解码提取具有方向性和符号意义的关系,且基于领域本体进行语义约束。在包含3491个关系的金标准数据集上进行全覆盖抽样验证,文本正确率达68.2%(严格匹配)至93.7%(宽松匹配)。两个大规模案例研究进一步验证了该框架的有效性:在奥地利案例中,成功重构了政党完整生命周期,精准定位内部裂痕时间点并追踪人员流向后续派系及司法判决;在波兰案例中,揭示了国家企业庇护网络中的经济与治理重叠结构,并构建了由公民平台(PO)与法律与公正党(PiS)构成的结构平衡、带符号的冲突网络。该框架实现了原始多语言文本与结构化关系数据之间的有效衔接,为跨国实证计算社会科学提供了可复现、稳健的研究基础。
链接: https://arxiv.org/abs/2606.27347
作者: Kirill Solovev,Jana Lasser
机构: IDea_Lab, University of Graz (IDea_Lab, 格拉茨大学)
类目: Computation and Language (cs.CL)
备注: 34 pages, 17 figures
Abstract:Whether political elites organise into rent-seeking coalitions that capture public resources or civic networks that sustain governance is a central question in comparative politics. Yet observing these complex, informal, and adversarial ties at scale has historically required intensive manual coding, while automated text-as-data methods have largely been limited to simple co-occurrence. Recent large language model (LLM) approaches offer a path forward but often rely on proprietary APIs, lack cross-lingual capability, and struggle with scalable entity resolution. We present a modular, fully open-weight pipeline for multilingual joint entity-relation extraction that builds signed, temporal knowledge graphs from massive unstructured news corpora. It combines span-based named-entity recognition (NER) with a three-stage linking cascade mapping mentions to language-independent Wikidata identifiers; a high-throughput, ontology-constrained mixture-of-experts model then uses guided decoding to extract directed, signed relationships grounded in a domain ontology. A full-coverage spot-check against a 3491-relation gold standard shows high textual correctness (68.2% strict to 93.7% lenient). Two large-scale case studies validate the pipeline against the public record. In Austria, it reconstructs a political party’s complete lifecycle, dating internal fractures and tracking personnel into successor factions and court convictions. In a Polish corpus, it uncovers the overlapping economic and governance networks of state-enterprise patronage, alongside the structurally balanced, signed conflict network of the polarized Civic Platform (Platforma Obywatelska, PO)–Law and Justice (Prawo i Sprawiedliwość, PiS) duopoly. By bridging raw multilingual text and structured relational data, our framework provides a robust, replicable foundation for cross-national empirical computational social science.
[NLP-2] Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning ACL2026
【速读】: 该论文旨在解决小型开源多模态大语言模型(Multimodal Large Language Models, MLLMs)在执行重复性图形用户界面(GUI)任务时面临的任务规划能力弱以及跨网站泛化能力有限的问题。其核心解决方案是提出一种自主探索环境以发现经验并利用事后经验(hindsight experience)构建严格对齐的高层级训练数据的“规划经验探索与利用”(Planning Experience Exploration and Utilization, PEEU)方法。PEEU通过生成高质量的高层任务实例,增强模型在分布外(Out-of-Distribution, OOD)场景下的规划能力。为系统分析这一性能提升背后的泛化机制,论文进一步提出“任务分解层级分析框架”(Task Decomposition Hierarchical Analysis Framework, TDHAF),用于从低、中、高三个粒度层次上研究组合泛化行为。实验结果表明,仅掌握底层原子技能不足以支撑高层规划能力,而基于高层任务的训练显著提升了模型的泛化性能;在真实世界基准测试中,7B规模的PEEU模型达到30.6%的准确率,超越了参数量更大的Qwen2.5-VL-32B模型。这证明了构建事后生成的高层任务并有效利用经验对于提升小型MLLM在复杂、跨域任务中的OOD规划能力至关重要。
链接: https://arxiv.org/abs/2606.27330
作者: Tianyi Men,Zhuoran Jin,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ACL 2026 Main
Abstract:Multimodal web agents can assist humans in operating repetitive GUI tasks, where effective task planning is essential for decomposing complex tasks into executable actions. While small open source MLLMs are cost efficient and privacy preserving compared with commercial large models, they suffer from weak planning and limited cross website generalization. To address these limitations, we introduce the planning experience exploration and utilization (PEEU) method, which autonomously explores environments to discover experiences and utilizes hindsight experience to synthesize strictly aligned, high level training data. To quantitatively analyze the generalization behaviors driving this performance, we propose the task decomposition hierarchical analysis framework (TDHAF) to systematically study compositional generalization across three task granularities: low, middle and high levels. Our analysis reveals that mastering low level atomic skills does not guarantee high level planning competence, while high level task training yields stronger OOD generalization. Experiments on real world benchmarks demonstrate PEEU’s superior effectiveness: our 7B model achieves 30.6% accuracy, outperforming the much larger Qwen2.5-VL-32B model. These demonstrate constructing hindsight high level tasks and leveraging experiences is crucial for OOD planning abilities of small MLLMs.
[NLP-3] LLM -Based Examination of Eligibility Criteria from Securities Prospectuses at the German Central Bank
【速读】: 该论文旨在解决德国中央银行在证券作为抵押品资格审查过程中,面对冗长、半结构化且常为德英双语混杂的招股说明书时,人工逐项核验资产是否符合法律与财务标准所面临的资源消耗大、效率低下的问题。传统基于命名实体识别(Named Entity Recognition, NER)的信息抽取方法受限于光学字符识别(OCR)噪声、语言变体以及严格的基于跨度的约束,且需为每类标注任务依赖大量人工标注数据。本文首次将大语言模型(Large Language Models, LLMs)应用于该资格审查流程,提出一种生成式信息抽取范式,通过将任务分解为抽取、归一化与解释三个阶段,显著提升了对噪声文本及德英内容交错场景的适应能力。其关键创新在于引入基于价值评估的LLM-as-a-judge评价方法,相较于传统的基于位置匹配的度量指标,能够实现更语义化的评估。实验结果表明,基于LLM的系统在文档级资格判断上达到高达91%的精确率,表现出保守的运行特性,有效降低了误接受风险。
链接: https://arxiv.org/abs/2606.27316
作者: Serhii Hamotskyi,Akash Kumar Gautam,Christian Hänig
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Verifying the eligibility of securities as collateral is a key responsibility of the German Central Bank. However, manually verifying these assets against legal and financial criteria within lengthy, semi-structured, and often bilingual prospectuses is a resource-intensive task. While previous efforts utilized traditional Named Entity Recognition (NER) for information extraction, these methods can struggle with OCR noise, linguistic variance, and rigid span-based constraints, and the need for manually annotated training data for each relevant annotation type. In this paper, we present the first case study applying Large Language Models (LLMs) to the eligibility examination process, shifting the paradigm toward a generative Information Extraction pipeline. Our approach decomposes the task into extraction, normalization, and interpretation, allowing for greater flexibility in handling noisy text and interleaved German-English content. We further introduce a value-based evaluation methodology using LLM-as-a-judge, which offers a more semantic assessment than location-based metrics. Our results demonstrate that LLM-based systems achieve high precision (up to 91%) in document-level eligibility, exhibiting a conservative operating profile that minimizes false acceptance.
[NLP-4] Beyond Surface Forms: A Comprehensive Mechanism-Oriented Taxonomy of Indirect Linguistic Encoding for LLM -Based Coded Language Detection EMNLP2026
【速读】: 该论文旨在解决社交媒体中用户为规避内容审查与监控而广泛使用的间接语言表达(Indirect Linguistic Expressions, ILE)的检测难题。此类表达形式多样,包括算法化言辞(algospeak)、委婉语及对抗性混淆等,其核心特征在于通过特定编码机制隐藏敏感含义。现有方法多依赖于基于语义目标的分类体系,难以适应快速演变的隐晦表达模式。本文提出一种以编码机制为导向的综合性分类框架,摒弃对沟通意图的依赖,转而聚焦于意义编码与解码过程中的底层操作逻辑。该方案的关键在于构建一个可稳定识别新兴编码语言的机制化分类体系,从而提升生成式AI在复杂语境下的检测能力。实验结果表明,将该分类框架嵌入大模型提示(prompt)后,在TikTok与Bluesky共2,000条人工标注数据上,相较于四个基准分类体系及无分类基线,实现了文档级和片段级准确率提升4.7%、F1值提升5.4%,验证了机制导向型分类在应对动态演化编码语言方面的有效性与实用性。
链接: https://arxiv.org/abs/2606.27314
作者: Hamid Reza Firoozfar,Mohammadsadegh Abolhasani,Reza Mousavi,Paul Jen-Hwa Hu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted for review in ARR for EMNLP 2026
Abstract:To avoid moderation and surveillance on social media, some users routinely invent indirect linguistic expressions (ILE) that camouflage sensitive meanings. Such expressions surface as algospeak, euphemisms, and adversarial obfuscation, depending on intent and context, and they involve recurring encoding mechanisms. We propose a comprehensive, mechanism-oriented taxonomy of ILE that abstracts away from communicative goals and instead categorizes the underlying operations through which meaning is encoded and recovered. We evaluate the taxonomy by incorporating it into LLM prompts and comparing it with four existing taxonomies and a no-taxonomy baseline, using 2,000 manually annotated TikTok and Bluesky posts. The proposed taxonomy attains the strongest document- and span-level performance across the three LLMs, achieving an improvement of 4.7% in accuracy and 5.4% in F1 over the best-performing benchmark. The empirical results reveal the importance of a comprehensive, mechanism-oriented taxonomy as a stable scaffold for detecting emerging coded language and a useful input to content moderation. Disclaimer: This paper contains content that may be profane, vulgar, or offensive.
[NLP-5] Multilingual Reasoning Cascades Need More Context
【速读】: 该论文旨在解决多语言推理中翻译级联(translation cascade)所导致的信息损失问题。传统方法在将用户查询从源语言翻译为英文、进行英文推理、再将答案回译至原语言的过程中,各阶段会丢弃对后续步骤至关重要的上下文信息,如文化背景线索、语体特征及歧义消解依据,从而引发错误传播。其解决方案的关键在于提出一种无需训练的简单干预策略:上下文感知的翻译级联(context-aware translation cascade),即在最终回译模块的上下文中同时保留原始问题、英文翻译后的问题以及推理过程(reasoning trace)。实验在涵盖九个不同任务类型的多语言基准上验证了该方法,覆盖三种主干模型及285种高、中、低资源语言,结果表明该策略在开放生成任务中显著提升了性能,且在不同模型和资源条件下均表现稳健。研究进一步发现,原始语言问题携带了最主要的有益上下文信息。该工作强调了优化机器翻译级联中信息流设计的重要性,并提出了一个简单且可直接实施的默认策略:在整个推理流程中始终保留用户的原始提问。
链接: https://arxiv.org/abs/2606.27306
作者: Arnav Mazumder,Dengjia Zhang,Shuyue Stella Li,Yulia Tsvetkov,Niyati Bafna
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Translation cascades for reasoning translate the query from another language to English, reason in English, and translate the answer back to the original language. This is a competitive approach to multilingual reasoning, but structurally lossy, since each stage discards information later stages may need, including cues for cultural grounding, register, and disambiguation. We examine the benefits of a simple and training-free intervention: a context-aware translation cascade, which additionally provides the original question, the English translated question, and the reasoning trace to the context of the final translation module. We evaluate gains across nine multilingual benchmarks including various task types, three backbone models, and 285 high-, mid-, and low-resource languages, and demonstrate strong gains for open-ended generation across models and resource regimes. We show that the original language question carries most of the beneficial context. Our study emphasizes the need to better design information flow in machine translation cascades for mitigating error propagation, and provides a simple and actionable default strategy: preserve the original user question until the end of the pipeline.
[NLP-6] How Surprising Is Historical Italian to Language Models? Tokenization Tax Comprehension Tax and a Simple Mitigation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理历史文本时面临的语言理解障碍问题,尤其关注其对早期现代语言(如17世纪意大利语、18世纪俄语)的适应性不足。传统观点将历史语言难度视为单一障碍,忽略了其内在的多维复杂性。为此,论文提出一个诊断框架,将历史语言困难分解为四个可量化维度:分词成本(tokenization cost)、预测不确定性(predictive uncertainty,即惊奇度,surprisal)、语义鲁棒性(semantic robustness)和上下文敏感性(context sensitivity)。该框架的关键创新在于揭示了编码成本与理解能力之间的解耦现象:尽管历史文本在分词层面带来显著负担(如俄罗斯语和早期现代意大利语均出现25%-30%的分词膨胀),但其预测不确定性存在显著差异——17世纪意大利语的惊奇度平均比现代版本高2.4倍,而俄语文本仅略有上升;然而,即便生成过程不稳定,嵌入表示的语义相似性仍保持较高水平(>0.85),表明模型仍能有效捕捉历史语义。进一步实验表明,引入一个简单的“时间上下文提示”即可使历史惊奇度降低约60%,实现无需模型修改的轻量级缓解策略。因此,解决方案的核心在于通过多维诊断识别关键瓶颈,并以最小干预提升模型在历史文本上的可用性,从而支持数字图书馆在语义检索等任务中安全部署LLMs,同时为生成式应用提供针对性优化路径。
链接: https://arxiv.org/abs/2606.27275
作者: Maria Levchenko
机构: University of Bologna(博洛尼亚大学)
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注: The 22nd Conference on Information and Research Science Connecting to Digital and Library Science
Abstract:Large language models (LLMs) are increasingly critical to digital library workflows, yet their ability to process historical language remains poorly understood. Historical difficulty is typically treated as a monolithic barrier, conflating orthographic variation, linguistic distance, and pretraining exposure. In this paper, we propose a diagnostic framework that decomposes this difficulty into four distinct dimensions: tokenization cost, predictive uncertainty (surprisal), semantic robustness, and context sensitivity. We evaluate this framework on three datasets spanning three centuries: (1) a newly curated corpus of 17th-century Italian texts (1610-1689) digitized from original page images; (2) canonical 19th-century Italian “I Promessi Sposi” serving as a high-exposure control; and (3) 18th-century Russian civil print books as a contrastive orthographic stress test. Our results reveal a distinct dissociation between encoding cost and comprehension. While Russian and early modern Italian incur comparable tokenization penalties (25-30% inflation), their predictive difficulty diverges sharply. 17th-century Italian is on average 2.4 times more surprising than its modern equivalent - with academic prose reaching 3.2 times - whereas Russian shows only a modest increase. But predictive uncertainty does not imply representational degradation: embedding similarity remains robust ( 0.85) across all datasets, confirming that models can represent historical meaning even when generation is unstable. Finally, we demonstrate that a minimal temporal context prompt reduces historical surprisal by approximately 60%, offering a simple, model-agnostic mitigation. These findings suggest that while historical text imposes a consistent encoding tax, digital libraries can safely deploy LLMs for semantic retrieval tasks, provided that generative applications are carefully adapted. Comments: The 22nd Conference on Information and Research Science Connecting to Digital and Library Science Subjects: Computation and Language (cs.CL); Digital Libraries (cs.DL) MSC classes: 68T50, 68P20 ACMclasses: I.2.7; H.3.3; H.3.7; I.7.5 Cite as: arXiv:2606.27275 [cs.CL] (or arXiv:2606.27275v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.27275 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Maria Levchenko [view email] [v1] Thu, 25 Jun 2026 16:52:21 UTC (512 KB)
[NLP-7] he Geometry of Updates: Fisher Alignment at Vocabulary Scale ICML2026
【速读】: 该论文旨在解决在共享词汇表的大型语言模型(LLM)家族中,无需训练即可进行源数据选择的问题,尤其针对化学分子式(SMILES)、蛋白质序列和基因组序列等科学字符串领域,其中候选语料库共享同一分词器但预测目标不同。在此背景下,传统基于表示相似性的度量(如核中心化相关性,CKA)在缺乏标签条件误差几何假设的情况下可能失效,而经典基于更新几何的度量则因词汇规模过大导致计算成本过高,形成“激活暗区”(activation-dark regime)。其核心解决方案是提出一种名为FisherSketch的新方法,关键在于揭示:在共享输出头设置下,表示相似性指标对迁移任务不可识别——即模型可具有完全相同的表示却拥有正交的头部更新。通过理论推导发现,头部费舍尔对齐(head Fisher alignment)等价于联合激活-误差空间中核均值嵌入之间的余弦相似度,从而显式解耦出激活、误差及其耦合因子,无需显式构建费舍尔矩阵。FisherSketch通过单次流式遍历直接估计该余弦值,实现K=128/256的头部费舍尔对齐计算,仅需16 KB的任务签名(m=4096)和每任务192 KB的流式状态,容量小至可与模型哈希并存,同时保留迁移相关的更新结构信息。此外,该方法还可作为诊断工具,用于分析LLM任务相似性是否由激活、误差或其耦合驱动;通过共享参数与内部层验证,以及基于Llama-3.1-8B的表述器偏移实验表明,当激活相似性无法区分任务时,FisherSketch仍具判别能力。
链接: https://arxiv.org/abs/2606.27242
作者: John Sweeney
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: Accepted at the 43rd International Conference on Machine Learning (ICML 2026), PMLR 306. 64 pages total (main paper plus appendix), 4 figures, 29 tables
Abstract:Training-free source selection for LLM families with shared vocabularies arises in scientific string domains such as SMILES, protein, and genomic sequences, where candidate corpora share a tokenizer but differ in prediction targets. This creates an activation-dark regime: representation-similarity metrics can be uninformative without assumptions about label-conditioned error geometry, while classical update-geometry metrics are computationally prohibitive at vocabulary scale. We show that, in a shared-output head setting, representation metrics (e.g., CKA) are non-identifiable for transfer; models can share identical representations yet have orthogonal head updates. The key identity is that head Fisher alignment is exactly a cosine between kernel mean embeddings in the joint activation-error space, exposing activation, error, and coupling factors rather than requiring a materialized Fisher matrix. FisherSketch estimates this cosine directly in a single streaming pass, making K=128,256 head Fisher alignment practical with a 16 KB task signature (m=4096) and a 192 KB per-task streaming state, small enough to store next to a model hash, but encoding transfer-relevant update structure. Beyond source selection, the same signatures and marginals provide a diagnostic instrument for studying whether LLM task similarity is driven by activations, errors, or their coupling; shared-parameter and internal-layer validations, together with Llama-3.1-8B verbalizer-shift experiments, show that FisherSketch remains informative when activation similarity cannot distinguish tasks.
[NLP-8] LMs as Task-Specific Knowledge Bases: An Interpretability Analysis
【速读】: 该论文旨在解决生成式人工智能模型(Generative AI)中事实知识表征的可靠性与一致性问题,具体关注语言模型(Language Models, LMs)是否具备类似传统知识库的特性——即对同一事实的不同查询应返回一致结果。研究发现,语言模型并非以统一、全局的方式存储知识,而是将知识以任务特定(task-specific)的方式编码于参数空间中。其解决方案的关键在于通过行为实验与机制分析揭示:相同事实在不同任务中的激活依赖于不同的参数子集,且链式思维(Chain-of-Thought)推理的有效性部分源于调用与评估任务无关的任务特定参数。这一发现表明,模型所“知道”的内容与其被提问的方式在参数层面深度耦合,从而挑战了将语言模型参数视为单一、稳定知识库的类比,对生成式模型中事实知识的可信赖性与可控性提出了重要警示。
链接: https://arxiv.org/abs/2606.27237
作者: Amit Elhelo,Amir Globerson,Mor Geva
机构: Blavatnik School of Computer Science and AI, Tel Aviv University (特拉维夫大学布罗瓦特尼克计算机科学与人工智能学院); Google Research (谷歌研究)
类目: Computation and Language (cs.CL)
备注:
Abstract:Language models (LMs) capture large amounts of factual knowledge applicable to a wide range of tasks, motivating the view of their parameters as a knowledge base. An important property of knowledge bases is that different queries for the same fact return consistent results, drawing on a single source of truth. We investigate whether LMs satisfy this property through behavioral and mechanistic analyses. Our results suggest that they encode knowledge in a task-specific manner. Behaviorally, facts acquired on one task frequently fail to co-emerge on others during training. Parameter localization experiments suggest a mechanistic explanation, revealing distinct parameter subsets underlying different tasks for the same fact. Finally, we show that chain-of-thought reasoning draws part of its effectiveness from engaging task-specific parameters beyond those tied to the evaluation task. Our findings suggest that what the model knows and how it is asked are intertwined in parameter space, undermining the “knowledge base” analogy and carrying implications for the reliability and controllability of factual knowledge in LMs.
[NLP-9] Bridging Talk and Thought: Understanding Dialogue Dynamics Across Collaborative Problem-Solving Contexts
【速读】: 该论文旨在解决协作式问题求解情境中对话动态分析的不足问题,尤其聚焦于人-智能体(human-AI)及多智能体协作过程中交互机制的理解与评估难题。当前分析方法在捕捉认知与非认知层面的问题求解过程,以及元认知调控机制方面存在局限性。为此,论文提出一种分层的双层编码框架,通过整合认知/非认知问题求解行为与元认知调节机制,系统刻画协作中的对话动态。该方案的关键在于将元认知调控作为深层协作的核心判别指标,不仅提升了分析的理论深度,还通过在九个跨领域的数据集上验证其有效性与泛化能力,揭示了人类与智能体在知识、技能与努力分配上的协同模式,为优化和评估人机协作伙伴关系提供了可操作的分析工具。
链接: https://arxiv.org/abs/2606.27233
作者: Zhengyuan Liu,Stella Xin Yin,Min-Yen Kan,Nancy F. Chen
机构: Agency for Science, Technology and Research (A*STAR), Singapore; Nanyang Technological University, Singapore; National University of Singapore
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We present a conceptual framework for analyzing dialogue in collaborative problem-solving contexts, with an emphasis on the emerging dynamics of human-AI and multi-agent collaboration. As intelligent systems become active agents capable of autonomous reasoning and strategic cooperation, understanding the dialogic interaction during collaborative problem solving is increasingly important for optimizing and evaluating such partnerships. Our framework addresses key limitations in current analytical approaches through a hierarchical two-layer coding scheme that integrates cognitive and non-cognitive problem solving with metacognitive regulatory mechanisms. We demonstrate its effectiveness and generalizability across nine datasets spanning multiple domains, and provide insights into how humans and agents coordinate their knowledge, skills, and efforts to solve complex problems, showing in particular that metacognitive regulation can be an essential discriminator of deeper collaboration.
[NLP-10] CARVE: Content-Aware Recurrent with Value Efficiency for Chunk-Parallel Linear Attention
【速读】: 该论文旨在解决当前基于增量规则(delta-rule)的循环神经网络架构GDN-2中存在的三大耦合缺陷:记忆盲视门控(memory-blind gating)、值轴擦除掩码导致的参数浪费,以及数学上阻碍WY形式三角块求解器有效性的根本问题。其核心解决方案是提出CARVE(Content-Aware Recurrent with Value Efficiency),通过统一原则——仅在键轴(key axis)上执行擦除操作——实现对上述问题的系统性修复。这一设计被证明是维持WY形式求解器有效性所必需且充分的条件。CARVE利用已存在于GPU内存中的递归输出张量作为免费的内容信号,驱动擦除门控,同时将每个头的逐值写入门投影简化为单个标量,显著提升参数效率。在初始化时,CARVE与GDN-2比特完全一致,性能差异源于内容门的学习能力。实验表明,在13亿参数、1000亿词元训练下,CARVE在WikiText困惑度上达到15.72(较GDN-2降低0.18,达4.5σ显著性),在九项常识推理基准上超越所有循环基线,并在所有RULER检索探测任务中达到新基准,同时仅带来0.4%吞吐率开销、峰值内存降低13%、参数量减少19%。研究还通过六个形式化定理分别论证了记忆容量、李雅普诺夫稳定性、梯度流、表达能力分离、帕累托最优块大小及混合最优性等关键性质。
链接: https://arxiv.org/abs/2606.27229
作者: Sayak Dutta
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 27 pages, 2 figures, multiple tables. Submitted to arXiv. Primary category: cs.LG; cross-list: cs.CL
Abstract:Recurrent models must forget in order to remember, yet the state of the art decides what to erase without consulting what is stored – the gate sees only the arriving token, not the memory it is about to modify. This memory-blind gating is one of three coupled defects in the leading delta-rule architecture (GDN-2): the value-axis erase mask wastes parameters at the scale of the value projection, and – as we prove – mathematically prevents the WY-form triangular chunk solver that makes recurrent training competitive with Transformers. We introduce CARVE (Content-Aware Recurrent with Value Efficiency), which resolves all three problems through one principle: erase only on the key axis. This is provably necessary and sufficient for the WY-form solver to remain valid. Within it, CARVE reuses the recurrent output tensor – already written to GPU memory – as a free content signal for the erase gate, and replaces the per-value write-gate projection with a single scalar per head. At initialisation CARVE is bit-identical to GDN-2; any quality difference emerges from what the content gate learns. At 1.3B parameters trained on 100B tokens, CARVE achieves WikiText perplexity 15.72 (minus 0.18 vs. GDN-2, a 4.5-sigma effect), leads every recurrent baseline on nine common-sense reasoning benchmarks, and sets state of the art on every RULER retrieval probe – at 0.4% throughput overhead, 13% lower peak memory, and 19% fewer parameters. Six formal theorems cover memory capacity, Lyapunov stability, gradient flow, expressivity separation, Pareto-optimal chunk size, and hybrid optimality. Comments: 27 pages, 2 figures, multiple tables. Submitted to arXiv. Primary category: cs.LG; cross-list: cs.CL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2606.27229 [cs.CL] (or arXiv:2606.27229v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.27229 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-11] Compositionality and the lexicon in evolutionary semantics
【速读】: 该论文旨在解决自然语言中语义普遍性(semantic universals)的演化机制问题,特别是如何在进化模型中实现对句子意义生成过程的合理建模。传统研究或假设词汇具有固定信号结构,或采用整体性组合而缺乏可解释的词汇成分,无法充分反映形式语义学所揭示的“递归组合”本质。本文的关键解决方案在于构建一个整合形式语义核心思想的演化框架,使词汇意义与组合函数在概念简洁性与交际准确性双重压力下协同演化。通过将该框架应用于量化表达意义的演化分析,研究发现最著名的语义普遍性之一——保守性(conservativity)作为系统层面高效抽象的产物自然涌现。该模型不仅对句法结构敏感,且有效调和了量化词习得的实证证据与先前演化模型之间的矛盾。研究表明,形式语义学中关于句子意义由词汇意义递归组合而成的观点,可与演化建模有机结合,为研究涉及语法范畴内全局压缩、句法论元的语义特化以及词汇与组合意义共演化等普遍现象提供了可推广的理论模板。
链接: https://arxiv.org/abs/2606.27228
作者: Fausto Carcassi
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Formal semantics has shown that sentence meanings arise by recursively composing lexical meanings, yet much of the literature on semantic universals models either lexicons with fixed signal structures or holistic composition without interpretable lexical parts. We introduce a framework that integrates this fundamental insight of formal semantics in evolutionary modeling, by allowing lexical meanings and a composition function to co-evolve under pressures for conceptual simplicity and communicative accuracy. We apply this framework to the evolution of quantificational meaning. Analyzing the Pareto frontier, we find that the most well-known semantic universal, conservativity, emerges as an efficient system-wide abstraction. The account is sensitive to syntactic structure and helps reconcile tensions between empirical evidence on quantifier learnability and prior evolutionary models. More broadly, the results demonstrate that the picture of sentential meaning developed in formal semantics can be productively combined with evolutionary modeling. The framework offers a template for studying universals that involve global compression within a grammatical category, semantic specialization of syntactic arguments, and the co-evolution of lexical and compositional meaning.
[NLP-12] Ask Dont Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement ICML2026
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)生成结果评估中的核心瓶颈问题:传统人工评估成本高、耗时长;基于词汇的评价指标在开放生成任务中与人类判断相关性差;而现有的整体式LLM评判模型常产生难以解释的评分,缺乏可诊断性。其解决方案的关键在于提出BINEVAL框架,通过将评估标准分解为原子级的二元判断问题(binary evaluation questions),由LLM独立回答每个问题并生成透明的问题级别反馈,再聚合形成可解释的多维评分。该方法不仅显著提升了评估结果与人类判断的相关性,尤其在事实一致性等关键指标(如QAGS基准)上表现优异,还有效避免了先前方法常见的评分天花板效应,增强了对边界案例与明显缺陷输出的区分能力。此外,问题级别的反馈可直接用于提示词迭代优化,在自更新与跨模型更新场景下均能提升摘要和生成任务的性能。总体而言,BINEVAL构建了一个无需训练、任务无关且具备强可解释性的评估框架,兼具出色的实证性能与实用的诊断与优化价值。
链接: https://arxiv.org/abs/2606.27226
作者: Sangwoo Cho,Kushal Chawla,Pengshan Cai,Zefang Liu,Chenyang Zhu,Shi-Xiong Zhang,Sambit Sahu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Acceepted to the Second Workshop on Compositional Learning at ICML 2026, Seoul, South Korea
Abstract:Evaluating LLM outputs remains a major bottleneck in NLP: human evaluation is expensive and slow, lexical metrics correlate poorly with human judgments on open-ended generation, and holistic LLM judges often produce opaque scores that are hard to debug. We propose BINEVAL, a framework that decomposes evaluation criteria into atomic binary questions and aggregates the resulting verdicts into interpretable, multi-dimensional scores. Given a task prompt, a meta-prompt generates fine-grained evaluation questions, and an LLM answers them independently for each output, yielding transparent question-level feedback together with calibrated overall scores. This decomposition makes evaluation easier to inspect, easier to diagnose, and directly usable for prompt improvement. Across SummEval, Topical-Chat, and QAGS, BINEVAL matches or outperforms strong baselines including UniEval and G-Eval, with especially strong results on factual consistency benchmarks such as QAGS. Beyond competitive correlation with human judgments, BINEVAL better matches human score distributions and avoids the ceiling effects common in prior LLM judges, leading to better discrimination between borderline and clearly flawed outputs. We further show that the same question-level feedback supports iterative prompt optimization, improving evaluator prompts on summarization and generation prompts on IFBench under both self-update and cross-model update settings. Overall, BINEVAL provides a task-agnostic, training-free, and interpretable evaluation framework that combines strong empirical performance with practical diagnostic and optimization value.
[NLP-13] Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes
【速读】: 该论文旨在解决现有安全分类器在判断用户意图时缺乏显式建模的问题,即当前方法往往将提示(prompt)与最终有害性标签之间的语义关联视为黑箱,忽略了用户真实意图这一关键信息。为此,作者提出了AIMS数据集——一个包含1,724个具有挑战性的安全相关提示的人工标注数据集,每个样本均配有明确的意图描述和伤害标签,以支持对意图感知型训练范式的系统评估。其核心解决方案在于引入显式意图信号作为监督信号,通过在多种训练范式(包括监督微调、偏好学习、推理蒸馏及强化学习)中融入意图一致性约束,显著提升模型的安全性表现。关键发现表明:基于生成式偏好优化(GRPO)直接奖励意图忠实性的方法在五个外部安全基准上实现了最优平均性能;同时,意图感知模型在推理延迟与F1分数之间形成了帕累托前沿,证明了忠实建模用户意图是一种紧凑而高质量的监督信号,能够有效增强安全分类器的鲁棒性。
链接: https://arxiv.org/abs/2606.27210
作者: Jeremias Ferrao,Niclas Müller-Hof,Iustin Sîrbu,Traian Rebedea,Yftah Ziser
机构: University of Groningen(格罗宁根大学); University Politehnica of Bucharest(布加勒斯特理工大学); NVIDIA(英伟达)
类目: Computation and Language (cs.CL)
备注:
Abstract:We argue that safety classifiers should model user intent as an explicit signal between the prompt and the final label. To study this, we introduce AIMS, a human-annotated dataset of 1,724 difficult safety prompts, each paired with an intent description and harm label. We use AIMS to evaluate intent-aware training across supervised fine-tuning, preference learning, reasoning distillation, and reinforcement learning. Despite its size, AIMS enables competitive safety classifiers across training regimes: DPO from model-generated intent errors improves over SFT, and intent-conditioned distillation outperforms reasoning-only distillation in most teacher-student pairs. Most notably, directly rewarding intent faithfulness with GRPO yields the strongest average performance across five external safety benchmarks, while our intent-aware models form the inference latency-F1 Pareto frontier. These results show that faithful intent modeling is a compact, high-quality supervision signal for more robust safety classifiers.
[NLP-14] Syntactic Belief Update as the Driver of Garden Path Processing Difficulty
【速读】: 该论文旨在解决传统语言加工模型在处理绕路句(garden path sentences)时的预测失效问题,即基于词汇意外度(lexical surprisal)的模型无法有效预测人类在理解此类句子时产生的加工困难。其核心问题是:尽管词汇意外度在大多数情况下能较好地预测句法加工难度,但在绕路句中,由于初始句法解析被后续关键词推翻,导致原有预测机制失灵。为此,论文提出一种新的解决方案——句法信念更新(Syntactic Belief Update),即通过构建并动态更新一个关于句法树的概率分布(句法信念),在每个新词输入后进行贝叶斯式更新。当处理器被引导至错误的句法路径时,该信念分布将产生显著偏差,而在关键词出现时,信念的大幅修正可通过广义Rényi散度量化。该指标虽依赖于词汇内容,但完全独立于词汇出现概率,从而避免了传统意外度对词汇频率的依赖。实证结果表明,该方法对人类阅读时间数据的拟合优于传统意外度模型,揭示了在心理语言学研究中探索纯句法非词汇性替代指标的新方向。
链接: https://arxiv.org/abs/2606.27206
作者: Alan Zhou,Miloš Stanojević,John T. Hale
机构: Johns Hopkins University (约翰霍普金斯大学); University College London (伦敦大学学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Garden path sentences present a processing difficulty for humans – the sentence prefix leads the listener towards one interpretation, until the listener hears a critical word that shows that the initial interpretation was wrong. Lexical surprisal, a measure that usually predicts sentence processing difficulty quite well, fails to provide good predictions for garden path sentences. We propose an alternative that actively predicts a probability distribution over syntactic trees (its syntactic belief) and updates that distribution after each new word. If a processor is led down a garden path, syntactic beliefs will be wrong and will require a large update at the critical word. The magnitude of the update is measured with a generalized Rényi divergence. Crucially, this metric is dependent on lexical items, but is fully independent of the probability of lexical items. This Syntactic Belief Update provides a better fit to the human reading time data on garden path sentences. This suggests a new research direction examining purely non-lexical alternatives to surprisal for psycholinguistics. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2606.27206 [cs.CL] (or arXiv:2606.27206v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.27206 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-15] Forecasting With LLM s: Improved Generalization Through Feature Steering
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在预测任务中普遍存在的“前瞻性偏差”(look-ahead bias)问题,即模型在推理过程中过度依赖未来信息或时间特定的线索,而非基于历史数据中可泛化的模式进行推断。其核心解决方案是通过稀疏自编码器(sparse autoencoders)对模型内部状态进行解析,识别出与时间感知(time-aware reasoning)和前瞻性偏差相关的可解释特征。研究发现,增强时间感知特征能显著降低前瞻性偏差,同时保持模型的通用推理能力;而对疑似前瞻性偏差特征进行干预则无效。这一结果表明,可解释的时间特征可用于因果性地引导模型采用更基于历史事实的推理方式,从而提升预测任务的可靠性与客观性。
链接: https://arxiv.org/abs/2606.27199
作者: Humzah Merchant,Bradford Levy
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Successful forecasting involves identifying patterns between historical and future states of the world which generalize to future observations. We apply LLMs to a variety of forecasting tasks and inspect their internal states using sparse autoencoders to understand whether they appear to rely on time-specific pieces of knowledge versus generalizable patterns. Our analyses identify features associated with both time-aware reasoning and look-ahead-biased reasoning. We then apply the LLMs to an entirely different domain and intervene on these features. We find that amplifying time-awareness features substantially reduces look-ahead bias on forecasting prompts while preserving general reasoning performance. In contrast, steering the candidate look-ahead-bias features does not produce an effect. These results suggest that interpretable temporal features can be used to causally shift LLMs toward more historically grounded reasoning.
[NLP-16] HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models
【速读】: 该论文旨在解决现有大型视觉-语言模型(Large Vision-Language Models, LVLMs)在有害视频内容审核评估中存在的两大核心问题:一是现有基准测试普遍忽视有害视频的多层级特征,将评估简化为二分类任务,难以捕捉隐性或深层语境中的危害;二是缺乏解释性推理依据,当前评估体系仅关注模型是否正确标记视频,而忽略其判断理由,导致模型可能通过表面线索“捷径”达成准确率,评估过程沦为黑箱。针对上述问题,论文提出HarmVideoBench——一个包含1,379个视频及4,137道多选题的多层级诊断基准,涵盖可观察证据(Observable Evidence)、片段内语义(Clip-Internal Meaning)与片段外推理(Beyond-Clip Reasoning)三个层次,以系统评估模型对有害视频的深层理解能力。同时,提出一种与基准对齐的BCR方法,通过预测推理边界并按需动态检索上下文,在不增加计算开销的前提下显著提升模型性能,使宏平均准确率从基线61.7%提升至84.4%,达到当前最优水平。解决方案的关键在于构建具有层次化结构的可解释评估框架,并引入动态上下文感知机制以增强模型的深层推理能力。
链接: https://arxiv.org/abs/2606.27187
作者: Jiajun Wu,Haoyu Kang,Yining Sun,Jiacheng Hou,Heng Zhang,Danyang Zhang,Zhenjun Zhao,Haochi Zhang,Leixin Sun,Eric Hanchen Jiang,Yushan Li,Ruiyu Li,Mengkai Huang,Yan Gao,Xu Zhang,Guancheng Wan
机构: Stanford University (斯坦福大学); Google(谷歌); Tsinghua University (清华大学); Peking University (北京大学); Shanghai Jiao Tong University (上海交通大学); Fudan University (复旦大学); Zhejiang University (浙江大学); MIT (麻省理工学院); University of California, Berkeley (加州大学伯克利分校); Carnegie Mellon University (卡内基梅隆大学); National University of Singapore (新加坡国立大学); University of Oxford (牛津大学); Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Large vision-language models (LVLMs) have recently shown immense potential in automated content moderation, sparking growing interest in developing harmful-video benchmarks. However, we identify two primary limitations in existing works: 1) The multi-layered characteristics of harmful videos are overlooked. Existing benchmarks predominantly formulate evaluation as a binary classification task, failing to capture implicit or deep contextual harms. 2) Explanatory rationales are completely absent. Current frameworks measure exclusively whether a model flags a video correctly rather than explaining why, turning evaluation into a black box where models can succeed through superficial shortcuts. To address these problems, we present HarmVideoBench, a multi-layered diagnostic benchmark comprising 1,379 videos paired with 4,137 multiple-choice questions. HarmVideoBench benchmarks three hierarchical dimensions: Observable Evidence, Clip-Internal Meaning, and Beyond-Clip Reasoning, aiming to evaluate models’ deep understanding beyond surface cues with carefully balanced and curated samples. We evaluate 19 leading models on HarmVideoBench to assess their multidimensional understanding of harmful videos. Moreover, we introduce BCR, a benchmark-aligned method that predicts reasoning boundaries and dynamically retrieves context only when needed. Experimental results show that BCR substantially improves the base model’s performance in harmful video understanding, raising the macro average from 61.7 percent to a state-of-the-art 84.4 percent.
[NLP-17] he Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在认知任务中表现出的高准确率是否源于灵活推理策略,还是仅依赖于训练数据中的模式匹配。其核心问题是:LLMs 是否具备根据问题内容动态调整推理策略的能力,而非被表面形式(如谜语结构)所诱导。为检验这一问题,作者提出“谜语谜题”(riddle riddle)范式——一种表面上模仿流行谜语但答案仅需字面解释的题目设计,要求模型或人类在面对相似结构时能灵活选择恰当的推理方式。研究发现,人类与LLMs在该范式下的表现呈现相反趋势:人类在真实谜语中准确率较低(50.5%),而在需要字面理解的谜语谜题中准确率显著提高(80.5%);而LLMs则在真实谜语中表现优异(84.9%),但在谜语谜题中大幅下降至50.7%。错误分析表明,LLMs在谜语谜题中的错误主要源于对创造性推理的不当使用(90.8%),而人类在真实谜语中的错误更多是因过度应用字面推理(57.6%)。因此,解决方案的关键在于通过对比“表面形式相似但逻辑要求不同”的任务设计,揭示LLMs的推理行为本质上更倾向于依赖形式线索和记忆检索,而非真正意义上的灵活策略选择。这提示我们,在缺乏此类对照实验的情况下,难以区分生成结果中看似合理的推理与真实推理之间的本质差异。
链接: https://arxiv.org/abs/2606.27103
作者: Bella Fascendini,Kathryn McGregor,Max D. Gupta,Thomas L. Griffiths
机构: Princeton University (普林斯顿大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Humans flexibly adapt their reasoning strategies to the requirements of a given problem. Large language models (LLMs) have performed well on many cognitive tasks, however, it is unclear whether this accuracy is a result of pattern matching from training data or flexible reasoning. Here, we introduce a novel paradigm to test this question: the riddle riddle paradigm. Riddle riddles are word problems written to mimic popular riddles, but altered so their answers only require literal interpretations. Identifying correct answers requires looking past the structure of each question and flexibly apply different reasoning strategies based on the content. If LLMs respond to surface features, such as form, a riddle-like structure should cause models to use an inventive reasoning strategy even when a literal interpretation suffices. Alternatively, if LLMs reason based on content, they should flexibly switch strategies when appropriate. Across two experiments with nine state-of-the-art LLMs and 100 human participants, we show humans and LLMs fail on this paradigm in opposite directions. LLMs were far more accurate on genuine riddles than on riddle riddles (84.9% vs. 50.7%); whereas humans showed the reverse effect (50.5% vs. 80.5%). Error analysis shows that 90.8% of LLM errors on riddle riddles (the condition where they show diminished performance) were due to inappropriate use of inventive reasoning while only 57.6% of human errors on genuine riddles were due to overextending literal reasoning. Thus, while both groups make mistakes, reasoning mistakes are made more often by LLMs than by humans. Overall, LLMs’ strong performance on genuine riddles may reflect memory retrieval rather than flexible strategy selection, and without stimuli designed to elicit this contrast, it becomes easy to conflate LLM-generated outputs that look like reasoning with genuine reasoning.
[NLP-18] owards Explainable Adjudicative Variance: Quantifying Judicial Discretion via Gated Multi-Task Learning ICML2026
【速读】: 该论文旨在解决法律判决预测中客观案件事实与裁判语境(adjudicative context)混淆的问题,即如何在预测判决结果时有效区分基于证据的实质裁决(merit-based rulings)与依赖法官自由裁量的技术性处理(technical disposals)。其核心解决方案是提出一种法官感知的门控多任务学习架构(Judge-Aware Gated Multi-Task Learning),通过引入细粒度的判决分类体系(fine-grained outcome taxonomy)对编码器进行监督,施加结构化正则化以解耦不同的语义路径。该架构利用门控融合机制(Gated Fusion),动态调节对法官身份信息的依赖程度,从而实现对裁判语境的显式建模。实验基于13,937份英国就业法庭判决数据集,相较于将法官身份与分类标签作为提示词或自回归输出目标的监督微调(SFT)方法,该模型在保持参数量减少一个数量级的前提下,在最模糊和罕见的判决类别上显著提升性能。此外,模型具备可解释性,通过学习到的法官嵌入(judge embeddings)与校准特征可定位裁判语境主导预测的案例。研究结果表明,在身份条件下的法律判决分类任务中,条件输入的可微分结构化组合方式相比基于提示的组合方式,能实现更准确、更高效的模型,其性能优势超越了单纯扩大模型规模。
链接: https://arxiv.org/abs/2606.27069
作者: Stanisław Sójka,Felix Steffek,Matthias Grabmair
机构: 未知
类目: Computation and Language (cs.CL)
备注: 17 pages (8 pages main text), 5 figures, 9 tables. Accepted to the AI for Law Workshop at the 43rd International Conference on Machine Learning (ICML 2026), Seoul, South Korea
Abstract:Legal outcome prediction must disentangle objective case facts from adjudicative context. Merit-based rulings rely on factual evidence while technical disposals may hinge on judicial discretion. We propose a Judge-Aware Gated Multi-Task Learning architecture that explicitly models this distinction. We introduce a fine-grained outcome taxonomy to supervise the encoder, enforcing a structural regularization that disentangles distinct semantic pathways. This granular legal curriculum enables our Gated Fusion mechanism to dynamically modulate reliance on judge identity. We evaluate our approach on 13,937 UK Employment Tribunal decisions. We benchmark our design against supervised fine-tuning (SFT) of a Gemma-4 26B-A4B backbone, in which judge identity and the taxonomy are injected as prompt tokens or autoregressive output targets. The two contextual signals compose only weakly when forced through a single autoregressive channel. In contrast, coupling a LoRA-adapted Gemma-4 encoder with our gated architecture defines a new state of the art on this benchmark while requiring an order of magnitude fewer trainable parameters than the generative SFT baselines, with gains concentrated on the most ambiguous and rarest outcome classes. Beyond accuracy, the architecture is interpretable; learned judge embeddings and calibration profiles localize the cases where adjudicative context drives the prediction. These results indicate that, for identity-conditioned classification of legal outcomes, the choice of conditioning interface dominates scale: differentiable structured composition yields more accurate, more parameter-efficient models than prompt-based composition over a substantially larger backbone.
[NLP-19] NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在核工程等高技术领域中可靠性不足的问题,尤其关注其在事实性知识、定量推理与概念理解等方面的评估缺失。针对这一挑战,研究提出NuclearQAv2基准测试,涵盖约1,240个问答对,分为布尔型、数值型和语义型三类任务,全面覆盖核工程领域的多维度能力需求。其解决方案的关键在于采用混合式构建流程,结合专家设计、现有数据集以及基于领域技术文献的生成式AI(Generative AI)辅助生成,并通过结构化提示(structured prompting)实现自动化问题生成与答案评估,从而支持可扩展的基准构建与评测体系。实验结果表明,尽管模型在事实性问题上表现良好,但在定量推理与概念理解方面仍存在显著短板,凸显了多维度评估框架的重要性,并验证了NuclearQAv2作为技术领域中可扩展评估基准的有效性。
链接: https://arxiv.org/abs/2606.27047
作者: Henry Shaowu Yuchi,Michal Kucer,Benjamin H. Sims,Selma Peterson,Emily Taylor
机构: Los Alamos National Laboratory (洛斯阿拉莫斯国家实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have demonstrated strong performance across a wide range of tasks, but ensuring their reliability in highly technical domains remains a significant challenge. In nuclear engineering, problem solving often requires not only factual knowledge but also quantitative reasoning and conceptual understanding. To address the need for systematic evaluation in this domain, we introduce NuclearQAv2, a benchmark for assessing LLMs on nuclear engineering knowledge. The benchmark comprises approximately 1,240 question-answer pairs spanning three categories: boolean, numeric, and verbal. NuclearQAv2 is constructed using a hybrid pipeline that combines expert-authored questions, existing datasets, and LLM-assisted generation from domain-specific technical corpora. By leveraging structured prompting for both automated question generation and response evaluation, the proposed framework enables scalable benchmark construction and evaluation. We evaluate a diverse set of LLMs using NuclearQAv2 and observe substantial performance differences across task types. While the models generally perform well on factual questions, quantitative reasoning and conceptual understanding remain considerably more challenging. These results highlight the importance of multi-faceted evaluation frameworks and establish NuclearQAv2 as a scalable benchmark for assessing LLM capabilities in technical domains.
[NLP-20] Improving General Role-Playing Agents via Psychology-Grounded Reasoning and Role-Aware Policy Optimization
【速读】: 该论文旨在解决通用角色扮演智能体在基于自然语言角色描述进行行为表现时,难以实现深层次、类人化内在思维过程的问题。现有主流方法——监督微调(Supervised Fine-Tuning, SFT)虽能促进表面行为模仿,但缺乏对角色心理状态的深层理解,导致模型在分布外场景下泛化能力差。为此,论文提出心理基础链式思维框架(Psy-CoT),将响应前的推理过程分解为三个角色特异性步骤:交互感知(Interaction Perception)、心理共情(Psychological Empathy) 和 逻辑构建(Logical Construction),使模型能够动态地从角色描述中推导出符合其人格特征的内在思考路径,而非仅依赖表层模式匹配。然而,仅靠结构化推理仍不足以确保角色一致性;因此,进一步引入强化学习以增强角色忠实度。研究发现,在基于大语言模型(LLM)的奖励模型下,通用性短语与真正角色相关的表达会获得相同的梯度信号,导致“奖励模型欺骗”现象累积,误导模型将二者视为同等最优策略。针对此问题,论文提出角色感知策略优化(Role-Aware Policy Optimization, RAPO),通过角色描述与词元之间的互信息来非对称地加权梯度:在正优势情况下放大角色相关词元的更新,在负优势情况下抑制其影响,从而有效区分并强化真实角色表达。实验在CoSER、CharacterBench和CharacterEval等多个基准上验证了Psy-CoT在角色扮演链式思维方法中的优越性,且RAPO在不同模型规模下均显著超越通用策略优化方法(GRPO)。
链接: https://arxiv.org/abs/2606.27025
作者: Zhenhua Xu,Dongsheng Chen,Jian Li,Yitong Lin,Zhebo Wang,Jiafu Wu,Yizhang Jin,Chengjie Wang,Meng Han,Yabiao Wang
机构: Zhejiang University (浙江大学); Tencent Youtu Lab (腾讯优图实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Building general-purpose role-playing agents that faithfully portray any character from a natural-language profile remains challenging. The dominant paradigm – supervised fine-tuning – encourages behavioral mimicry without deep, human-like internal thought processes, resulting in poor out-of-distribution generalization. Therefore, we propose \textbfPsy-CoT, a psychology-grounded chain-of-thought framework that decomposes pre-response reasoning into three role-specific steps – \emphInteraction Perception, \emphPsychological Empathy, and \emphLogical Construction – so that the model \emphthinks dynamically from the profile rather than merely mimicking surface patterns. While structured reasoning provides a foundation, it alone is insufficient; reinforcement learning is essential to further align the model with character fidelity. However, we observe that under LLM-based reward models, both generic phrases that hack the reward model and genuinely role-specific phrases receive identical gradient signals – this hacking accumulates over training, misleading the model into treating both as equally optimal choices. To address this, we propose \textbfRole-Aware Policy Optimization (RAPO), which uses profile–token mutual information to weight gradients asymmetrically – amplifying role-specific tokens under positive advantage while attenuating them under negative advantage. Experiments on CoSER, CharacterBench, and CharacterEval demonstrate that Psy-CoT outperforms existing role-playing CoT methods, and RAPO consistently surpasses GRPO across multiple model scales.
[NLP-21] Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在医学视觉问答(Medical Visual Question Answering, VQA)任务中产生的过度自信问题,即模型即使在答案错误的情况下仍表现出高置信度。现有基于文本的置信度校准方法无法有效处理医学图像理解中的多模态特性,因而不适用于此类场景。为此,本文提出一种基于训练的校准框架,其核心在于设计一个复合损失函数,融合了类Brier校准项、锚定正则化项(防止置信度坍缩至极端值)、对比图像-文本对齐项以及基于KL散度的模型稳定性项。其中,图像-文本对齐信号通过一个2×2因子扰动实验设计获得,该设计交叉考察图像存在性与文本完整性,从而量化模型对视觉模态输入与语言先验的依赖程度。此外,引入顶部K个KL散度正则化项以在微调过程中保护模型的推理能力。实验结果表明,在三个医学VQA基准和两种主流架构(MedGemma 4B IT与Qwen2 VL 7B Instruct)上,该方法将校准误差降低60%以上,同时提升判别能力26%以上,且保持预测准确性。消融实验证明损失函数各组件均对校准性能提升具有必要性。整体而言,该方法在性能上优于基于提示、采样及训练的现有方案,相关代码已公开。
链接: https://arxiv.org/abs/2606.27023
作者: Eren Senoglu,Federico Toschi,Nicolo Brunello,Andrea Sassella,Mark James Carman
机构: Politecnico di Milano(米兰理工大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal large language models (MLLMs) applied to Medical Visual Question Answering (VQA) tend to produce overconfident outputs regardless of actual correctness, and existing verbalized confidence calibration methods, developed primarily for text only LLMs, do not account for the multimodal nature of medical image understanding. This work proposes a training based framework that finetunes MLLMs to improve their calibration using a composite loss function combining a Brier style calibration term, an anchor regularizer that prevents confidence collapse toward extreme values, a contrastive image text alignment term, and a KL based model stabilization term. The alignment signal is derived from a 2 \times 2 factorial perturbation design that crosses image presence with text integrity, probing the reliance of the model on visual modality input versus language priors. Finally, a top K KL divergence regularizer is used to protect the answering ability of the model during finetuning. Across three Medical VQA benchmarks and two architectures (MedGemma 4B IT and Qwen2 VL 7B Instruct), our method reduces calibration error by 60% or more, and improves discrimination by 26% or more, while preserving predictive accuracy. On average across benchmarks, the technique outperforms prompting based, sampling based, and training based approaches, and ablation experiments confirm that each component of the loss function is indeed necessary for improving the calibration. All code for the experiments is publicly available. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.27023 [cs.LG] (or arXiv:2606.27023v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.27023 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-22] MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment
【速读】: 该论文旨在解决传统Unigram分词器在训练过程中计算开销大、流程复杂的问题,同时保持其灵活的词表编辑能力与优秀的语言建模性能。其解决方案的关键在于提出MinGram(Minimalist Unigram),通过引入基于BPE生成的初始词表作为种子,采用最小路径上的硬期望最大化(Hard EM)策略,并结合单次全局得分剪枝步骤,彻底摒弃了传统的后缀数组、前向-后向传递及迭代剪枝循环,显著简化了训练流程。该方法以词元数量为首要优化目标,仅将Unigram得分用作次要的冲突消解机制,在实现接近纯词元计数方法压缩率的同时,保留了概率型分词器在形态对齐和下游任务表现上的优势。实验表明,MinGram在六种语言上均优于BPE与标准Unigram,在压缩性能上表现更优;其面向压缩的变体甚至可媲美最强词元计数压缩器,且维持更高的形态对齐能力;在受控的下游语言模型训练中,属于Unigram家族的分词器(包括MinGram)在比特/字节(bits-per-byte)指标上持续超越BPE。
链接: https://arxiv.org/abs/2606.27019
作者: Sander Land
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The Unigram tokenizer uses an elegant representation which makes it straightforward to edit vocabularies, but its training is comparatively heavy and complex. We introduce MinGram (Minimalist Unigram), which keeps the token-list representation but simplifies training using a BPE-derived seed vocabulary, Hard EM on a minimum-token path, and a single flat score-pruning step. This removes the suffix array, the forward-backward pass, and the iterative prune loop, leaving a procedure that requires little beyond tokenizer inference itself. By making token count the primary objective and using a Unigram score only as a tiebreak, MinGram keeps the compression of pure token-count methods while retaining much of the morphological alignment and downstream quality of probabilistic ones. Across six languages, MinGram compresses better than both BPE and standard Unigram, and a compression-oriented variant matches the strongest token-count compressors while retaining substantially higher morphological alignment. In controlled downstream language-model training, Unigram-family tokenizers, with MinGram among the best, consistently beat BPE in bits-per-byte.
[NLP-23] Where Do Models Find Happiness? Emotion Vectors in Open-Source LLM s
【速读】: 该论文旨在验证先前在Claude Sonnet 4.5中发现的情绪向量(emotion vectors)的普遍性,即这些内部表征是否在其他开源大语言模型中同样存在,并具有与人类心理结构相对应的几何特性。其核心问题在于:情绪概念在不同架构的语言模型中是否以可复现的方式被编码,以及这种编码模式在模型深度和训练数据分布上的差异如何影响情绪维度(如效价[Valence]与唤醒度[Arousal])的表达。解决方案的关键在于采用跨模型、跨层的对比分析方法,通过在两个开放权重模型(Apertus-8B-Instruct-2509 和 Gemma-4-E4B-it)中提取全层情绪对比向量,并利用两种由模型生成的语料库进行实验。研究发现,两模型均表现出显著的效价几何结构(效价主成分PC1相关性分别达 r = 0.76 与 r = 0.83),接近先前报告结果,但效价编码模式在模型深度上呈现相反趋势:Gemma早期层强效价编码,后期衰减;Apertus则早期缺失,中期才出现。此外,唤醒度编码对生成语料敏感,仅在Gemma生成文本中表现出更强的对应关系(r ≤ 0.45),表明唤醒相关线索在不同生成数据中分布不均。该研究通过开源代码与数据集,为跨模型情绪表征的可复现研究提供了基础。
链接: https://arxiv.org/abs/2606.26987
作者: Sinie van der Ben,Raphaël Baur,Yannick Metz,Mennatallah El-Assady
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent work identified emotion vectors in Claude Sonnet 4.5, which are internal representations that encode emotion concepts, causally influence behavior, and exhibit geometry mirroring human psychological structure. We test the generality of these findings in two open-weight models, Apertus-8B-Instruct-2509 and Gemma-4-E4B-it, extracting emotion contrast vectors across all layers, using two model-generated corpora. We recover valence geometry for both models, with peak PC1–valence correlations of r = 0.76 and r = 0.83 , approaching the r = 0.81 reported for this http URL replication, we observe notable differences in how valence representations emerge across model depth. In Gemma-4-E4B-it, valence is strongly encoded in early layers but collapses towards later layers, whereas Apertus-8B-Instruct-2509 exhibits the opposite pattern, with valence representations absent in early layers, but emerging at mid depths. Arousal encoding, in contrast, is sensitive to the extraction corpus: both models show stronger PC2–arousal alignment with Gemma-generated stories ( r up to 0.45 ) than Apertus-generated ones ( r \leq 0.21 ), suggesting arousal-relevant cues are unevenly distributed across generated corpora. We open-source our experiment code and dataset for reproducible investigation of emotion representations across language model architectures.
[NLP-24] ReaORE: Reasoning -Guided Progressive Open Relation Extraction Empowered by Large Reasoning Models
【速读】: 该论文旨在解决开放域关系抽取(Open Relation Extraction, OpenRE)中模型对未见关系类型泛化能力不足的核心挑战。现有方法要么依赖聚类技术,无法生成明确的关系标签且泛化性能差;要么直接利用大语言模型(Large Language Models, LLMs)进行关系标签生成,但缺乏足够的判别能力以区分语义相近的易混淆关系。为此,本文提出一种基于推理引导的渐进式开放域关系抽取框架——ReaORE,其关键在于通过“粗粒度到细粒度”的分阶段推理机制实现更可靠的未知关系识别:第一阶段为关系过滤,综合多维度信息对关系与实例进行推理,生成初始关系集合,并结合嵌入相似性进一步补充和筛选,确保目标关系被包含在候选集中;第二阶段为关系预测,通过对候选集中的关系进行细粒度对比推理,增强对易混淆关系的区分能力。实验结果表明,ReaORE在两个主流OpenRE数据集上均显著优于现有基线方法。
链接: https://arxiv.org/abs/2606.26986
作者: Xin Lin,Liang Zhang,Guoqi Ma,Hongyao Tu,Jinsong Su
机构: Xiamen University(厦门大学); National Institute for Data Science in Health and Medicine(健康与医学数据科学国家研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Open Relation Extraction (OpenRE) requires a model to extract unseen relations between head and tail entities from unstructured text for real-world applications. The core challenge of OpenRE lies in achieving reliable generalization to unseen relation types. Current OpenRE approaches either employ clustering techniques, which cannot generate relation labels and suffer from poor generalization, or rely on direct relation label generation via Large Language Models (LLMs), which lack sufficient discriminative capacity to distinguish easily confused relations. To address these limitations, we propose Reasoning-guided progressive OpenRE (ReaORE), a framework for performing relation extraction through coarse-to-fine relation reasoning. Specifically, ReaORE consists of two key stages: (i) relation filtering, which reasons over multiple aspects to understand relations and instances, yielding an initial relation set, and further supplements and filters relations via embedding-based similarity to ensure the target relation is included; (ii) relation prediction, which aims to predict the target relations from the above set via fine-grained comparative reasoning to better distinguish easily confused relations. Extensive experiments on two widely used OpenRE datasets demonstrate that ReaORE outperforms existing baselines.
[NLP-25] Auditing Framing-Sensitive Behavioral Instability in Large Language Models for Mental Health Interactions
【速读】: 该论文旨在解决生成式 AI 在心理健康的敏感对话应用中因语境框架(contextual framing)差异导致的行为不一致性问题。具体而言,尽管用户表达的语义相似,但不同的语境表述可能引发模型输出显著不同的响应,这种“框架敏感性”会削弱用户对系统行为稳定性的预期,进而影响其对AI可靠性与可信度的判断。论文的核心解决方案在于揭示语境框架变化如何影响对齐语言模型内部表征的可变性。关键发现包括:在多类指令微调模型中,语境框架系统性地改变模型的解释倾向;通过逐层探针分析,表明与行为相关的信息在整个Transformer网络深度中仍可解码,且不同架构间解码强度存在差异;即使在强词汇基线条件下,未见于训练集的框架探针仍显著优于随机水平;激活引导实验进一步验证了与框架相关的表征方向可部分调控下游行为输出。因此,该研究强调,在评估面向心理健康支持等高敏感场景的对话系统时,鲁棒性(对上下文变异的抵抗能力)应作为衡量其一致性与可信度的关键考量因素。
链接: https://arxiv.org/abs/2606.26982
作者: Abla Bedoui,Ashley L. Greene,Mohammed Cherkaoui
机构: Linköping University (林雪平大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly being integrated into mental health support tools and other psychologically sensitive conversational applications. In such settings, behavioral stability and consistency are important for trustworthy human-AI interaction. However, semantically similar concerns can be presented through different contextual framings, potentially eliciting different model responses. Such framing-sensitive variability may challenge user expectations regarding system behavior and complicate the assessment of AI reliability. While prior studies have primarily examined such effects at the behavioral level, less is known about how framing-related variation is reflected in the internal representations of aligned language models. In this work, we investigate these effects using controlled matched prompts spanning multiple contextual framing conditions across several instruction-tuned model families. Across architectures, framing systematically alters interpretive response tendencies. Layer-wise probing analyses show that behavior-associated information remains decodable throughout transformer depth, with architecture-dependent variation in decoding strength. Moreover, held-out framing probes remained consistently above chance across architectures despite strong lexical baselines. Activation steering experiments further suggest that framing-associated representational directions can partially modulate downstream behavioral outcomes. Finally, these findings indicate that robustness to contextual variation may represent an important consideration when evaluating the consistency and trustworthiness of conversational AI systems deployed in mental-health-oriented interactions.
[NLP-26] Einstein World Models
【速读】: 该论文试图解决的问题是:当前大语言模型(LLM)在进行复杂推理时,是否受限于仅依赖语言符号的表达方式,而无法有效处理超越直接经验的现象(如反事实事件)的思考。研究核心关切在于,视觉化反事实场景是否能够作为补充机制,提升模型的深层推理能力。其解决方案的关键在于提出“爱因斯坦世界模型”(Einstein World Models, EWM),即在大语言模型的推理轨迹中引入视觉-时间滚动(visual-temporal rollouts)机制,通过调用一个“世界模块”(world-module)生成特定情境下的短时程场景演化序列。这些生成的视觉化滚动序列不作为最终答案,而是作为可检视的假设性推演过程,用于支持后续的逻辑推理。该方法将工具调用能力(如网络搜索或代码执行)拓展至视觉思维实验领域,使模型能够在语言之外借助可视化模拟实现更深层次的因果与反事实推理。
链接: https://arxiv.org/abs/2606.26969
作者: Munachiso Samuel Nwadike,Zangir Iklassov,Ali Mekky,Zayd M. Kawakibi Zuhri,Kentaro Inui
机构: MBZUAI; RIKEN AIP, Japan; Tohoku University
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages (9 without references), 2 figures, 1 algorithm
Abstract:Does intelligence require the ability to reason about phenomena beyond direct experience? It is natural to suspect that some complex thought cannot be captured through language alone. However, of particular concern to this work, is whether visualising counterfactual events can complement language as a mechanism for complex thought. We ask whether LLMs can be trained to utilise such visualisation mechanisms, in a way that benefits their reasoning abilities. Motivated by this question, we propose Einstein World Models. EWMs are a blueprint for LLM-based reasoning systems that place visual-temporal rollouts inside the reasoning trace, allowing them to reason in ways that text alone may not support well. In an EWM, the LLM calls a world-module (not to be confused with a world model), to produce short rollouts of scenes under consideration. The returned rollout is treated not as the answer, but as an inspectable hypothesis that can support later reasoning. Einstein World Models extend the capability of LLMs for tool calling (such as web search or code execution), into the domain of visual thought experiments.
[NLP-27] RedVox: Safety and Fairness Gaps in Speech Models Across Languages
【速读】: 该论文旨在解决多语言环境下语音模型在真实场景中安全性和公平性评估不足的问题,尤其关注非英语语境下的潜在风险。当前主流语音模型发布中仅有8%报告了多语言安全性分析,表明现有研究存在显著空白。为此,论文提出RedVox——一个基于真实人声的多语言语音安全与公平性基准,涵盖英语、法语、意大利语、西班牙语和德语共五种语言,针对不安全及刻板印象类请求进行系统评估。其解决方案的关键在于构建一个覆盖多种语言、基于真实语音输入的评测框架,揭示即使在非对抗性条件下,语音模型仍存在安全隐患,且这些风险在非英语语言中更为突出,尤其当请求以语音形式输入时进一步加剧。此外,研究还通过参与者调研揭示了自然语境下语音数据收集所面临的独特个人隐私与伦理挑战,凸显了语音安全研究背后的复杂社会技术问题。
链接: https://arxiv.org/abs/2606.26968
作者: Beatrice Savoldi,Sara Papi,Wafa Aissa,Matteo Negri,Luisa Bentivogli
机构: Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会)
类目: Computation and Language (cs.CL)
备注:
Abstract:Speech-capable models are increasingly deployed in real-world applications across languages. Yet their safety and fairness beyond English settings and under naturalistic conditions remain understudied. We survey safety reporting practices across state-of-the-art speech model releases, finding that only 8% document any multilingual analysis. To address this gap, we introduce RedVox, a multilingual safety and fairness benchmark for audio and speech built on real voices, covering unsafe and unfair stereotypical requests across five languages (English, French, Italian, Spanish, and German). Evaluating eight state-of-the-art models, we find that vulnerabilities persist even under non-adversarial conditions, worsen in non-English languages, and are amplified when the request comes from a spoken input. Finally, by surveying the participants who contributed to RedVox, we document the unique personal and privacy challenges of collecting speech data with human participants, pointing to broader sociotechnical challenges in naturalistic speech safety research.
[NLP-28] rm-Centric Hierarchy Induction from Heterogeneous Corpora
【速读】: 该论文旨在解决从异构文本源中组织知识以构建可解释的层次化分类体系(taxonomy)的问题,尤其针对政策分析、创新监测和领域探索等任务。现有方法多依赖于文档级表示,难以捕捉与知识组织相关的关键领域概念,导致在跨源泛化能力上存在局限。其解决方案的核心在于提出一种以术语为中心(term-centric)的框架,通过自动术语提取将来自不同来源的文档映射到共享表示空间,实现鲁棒的跨源对齐;在此基础上,结合领域先验知识与数据驱动的聚类算法,构建兼具可解释性与高质量的层次结构。实验结果表明,该方法在超过百万文档的英德双语多源基准数据集上显著提升了跨源一致性与层级质量,且在德国区域创新分析的案例研究中验证了其在技术图景绘制中的实际应用价值。
链接: https://arxiv.org/abs/2606.26963
作者: Elena Senger,Yuri Campbell,Jan-Peter Bergmann,Rob van der Goot,Barbara Plank
机构: LMU Munich (慕尼黑大学); Fraunhofer Institute for Systems and Innovation Research ISI (弗劳恩霍夫系统与创新研究所); IT University of Copenhagen (哥本哈根信息技术大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Organizing knowledge from diverse text sources into interpretable hierarchies is crucial for tasks such as policy analysis, innovation monitoring, and exploratory domain mapping. Existing taxonomy induction methods typically rely on document-level representations that capture entire documents rather than the specific domain concepts relevant for knowledge organization, limiting their ability to generalize across heterogeneous sources. We propose a term-centric framework for inducing hierarchical taxonomies from heterogeneous corpora that scales to massive document collections. Our approach maps documents from diverse sources into a shared representation space using automatic term extraction, enabling robust cross-source alignment. Based on these representations, we construct interpretable hierarchies that integrate domain priors with datadriven clustering. Experiments on a novel English and German multi-source benchmark of over one million documents demonstrate that our method improves cross-source coherence and hierarchy quality over text- and summarybased baselines. A case study on German regional innovation analysis further demonstrates its practical utility for technology landscape mapping.
[NLP-29] Jailbreaking for the Averag e Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries
【速读】: 该论文旨在解决当前大型语言模型(LLM)面临的一个关键安全问题:非专业恶意用户(即“普通乔安妮”)是否能够通过简单易用的攻击手段,成功诱导模型生成有害或违规响应。这一担忧在大量已知的越狱攻击(jailbreak)背景下尤为突出。为验证该担忧的合理性,研究提出了两个核心解决方案:其一,设计了一种基于多臂赌博机(multi-armed bandit)框架的新型攻击策略,实现对大规模潜在越狱模板的高效在线学习,在少量噪声探索的基础上快速收敛至最优攻击序列,并在后续推理中进行有效利用;其二,构建了名为FrankensteinBench的安全评估基准,包含11,279条经人工筛选并经自动化增强与生成的恶意查询,按技术复杂度分为简单与复杂两类。实验结果表明,该方法在15个主流开源大模型上平均成功率高达97%,且复杂查询可使攻击成功率平均提升26%,证明了复杂性作为可自动化的提示策略具有显著增效作用,从而证实了非专业攻击者具备实际威胁能力的合理性。
链接: https://arxiv.org/abs/2606.26936
作者: Prarabdh Shukla,Ritik,Suhas Rao,Arpit Agarwal,Arjun Bhagoji
机构: Model call failure
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:With a profusion of jailbreaks for LLMs now widely known, a growing concern is that non-expert malicious actors (“the average Jane”) could elicit actionable responses to malicious requests. In this work, we examine whether this concern is justified. A non-expert malicious actor requires two ingredients for a successful attack: a powerful jailbreak for their target model, acting on an effective malicious query. For the former, we propose a novel attack strategy based on the multi-armed bandit framework. This allows efficient online learning of the optimal jailbreak from a large choice set via noisy exploration on a small number of queries, with subsequent application of the learnt policy on an exploitation set. For the latter, we curate \mathrmFrankensteinBench , a safety benchmark of 11,279 malicious queries drawn from manual curation over 7 existing benchmarks, along with automated enhancement and generation. Each query is categorized as simple or complex by the technical expertise required to craft it. Our findings confirm the concern. Our bandit-based attack achieves success rates as high as 97% on average over 15 SoTA open-weight LLMs. Moreover, adding complexity to queries raises the attack success rate by up to 26% on average across models – making it an effective, automatable prompting strategy.
[NLP-30] GAVEL: Grounded Caption Error Verification and Localization
【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在生成图像-文本描述时存在的幻觉(hallucination)与不一致问题,即文本描述与图像内容之间缺乏语义对齐。现有方法多局限于检测错位,难以提供错误解释及定位视觉证据。为此,论文提出GAVEL(Grounded Caption Error Verification and Localization)任务,通过联合实现验证(verification)、解释(explanation)和定位(localization)三方面能力,系统性地识别并解析图像-文本对中的语义偏差。其解决方案的关键在于构建一个包含人工标注的基准数据集,支持对模型在对齐性验证、错误归因和视觉证据定位等维度上的综合评估,并训练一个监督基线模型以验证该任务可为模型提供可学习的监督信号。实验表明,即使先进的闭源模型在GAVEL任务上表现不佳,而监督基线模型在对齐性和解释性指标上均实现了稳定提升,验证了该任务的有效性与可学习性。
链接: https://arxiv.org/abs/2606.26923
作者: Zixian Gao,Atsushi Hashimoto,Kuniaki Saito
机构: OMRON SINIC X Corporation
类目: Computation and Language (cs.CL)
备注: conference
Abstract:Vision-language models (VLMs) often produce hallucinated or inconsistent outputs, where text and images are not properly aligned. Addressing this issue requires not only detecting misalignment but also explaining the discrepancy and localizing its visual evidence. We introduce GAVEL (Grounded Caption Error Verification and Localization), a task that jointly addresses verification, explanation, and localization for image-text pairs. To support systematic evaluation, we also present a corresponding dataset and benchmark. We further train a supervised baseline on the human-annotated training split to assess whether GAVEL provides learnable supervision for these abilities. Experiments show that even strong closed-source models struggle on GAVEL, while the supervised baseline yields consistent improvements across grounding and explanation metrics.
[NLP-31] SamaVaani: Auditing and Debiasing Multilingual Clinical ASR for Indian Languages
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在多语言、多人口学特征的印度医疗环境中自动语音识别(ASR)系统可靠性不足的问题。现有ASR模型在印度本土语言(如卡纳达语、印地语)及印度英语中的表现存在显著差异,尤其在不同说话人角色与性别群体间表现出系统性偏差,影响其在临床场景中的公平部署。研究的关键解决方案在于:通过针对表现最优的开源模型(Gemma3n 和 OmniLingual)进行多种微调策略实验,发现并量化了由说话人角色与性别引发的性能差距;进而提出 SamaVaani——一种统一的去偏技术,能够在提升整体识别准确率的同时,显著改善不同人口学群体间的公平性表现,从而为构建更具包容性与可信赖的临床语音转录系统提供可行路径。
链接: https://arxiv.org/abs/2606.26901
作者: Subham Kumar,Prakrithi Shivaprakash,Abhishek Manoharan,Astut Kurariya,Diptadhi Mukherjee,Prabhat Chand,Pratima Murthy,Koustav Rudra,Lekhansh Shukla,Animesh Mukherjee
机构: IIT, Kharagpur(印度理工学院,克哈拉格普尔); NIMHANS, Bangalore(国家精神卫生与神经科学研究所,班加罗尔); LGBRIMH, Tezpur(东北地区精神卫生中心,特兹普尔)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Automatic Speech Recognition (ASR) is increasingly used to document clinical encounters, yet its reliability in multilingual and demographically diverse Indian healthcare context remains largely unknown. In this study, we first conduct the systematic audit of ASR performance on real-world psychiatric interview data spanning Kannada, Hindi and Indian English, comparing eight state-of-the-art models including IndicWhisper, WhisperLargeV3, Sarvam, GoogleS2T, Gemma3n, OmniLingual, Vaani, and Gemini. Our results reveal substantial variability across models and languages, with some systems performing competitively in Indian English but failing in regional speech. We further fine-tune two of the best performing opensource models, i.e., Gemma3n and OmniLingual, using various methods. With this, we uncover systematic performance gaps tied to speaker role and gender, raising concerns about equitable deployment in clinical settings, which are further mitigated by fairness-aware fine-tuning. To this end, we propose SamaVaani, a unified debiasing technique that simultaneously improves ASR performance and improves fairness across demographic groups.
[NLP-32] Heterogeneous Neural Predictivity from Language Models During Naturalistic Comprehension
【速读】: 该论文旨在解决如何有效利用语言模型(Language Model)提取的表征来预测自然语言输入下大脑神经活动的问题,尤其关注这些表征在真实语境中对脑电与皮层电图(ECoG)等神经数据的解释能力。其核心解决方案在于通过多源数据集(Brain Treebank、MEG-MASC、Podcast ECoG)与多个冻结的语言模型、分块编码模型及多种控制条件(包括时间、干扰因素和表征容量控制),系统评估语言模型衍生特征作为神经预测因子的有效性。关键发现表明,语言模型生成的高维语义表征在源级摘要中普遍具备正向预测能力,且显著优于低阶基线;在部分分析单元中,模型特征的消融操作会显著改变预测性能,说明其预测贡献具有可分离性。同时,通过引入基于脑源、时间同步、声学信号及植入式信号的对照组,验证了分析流程在组件层面的敏感性。研究结果表明,语言模型特征可作为自然语言理解过程中神经活动的有效预测因子,但其预测效用不应被直接等同于共享神经组织结构或语言处理计算机制的证据,强调了预测能力与深层认知机制之间的区分必要性。
链接: https://arxiv.org/abs/2606.26880
作者: Xiao Jia
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳))
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Language-model representations provide structured, high-dimensional annotations of naturalistic language stimuli and can serve as informative neural predictors during comprehension. We analyzed locked derived data from Brain Treebank, MEG-MASC, and Podcast ECoG with eight frozen language models, blocked encoding models, and matched temporal, nuisance, and representation-capacity controls. Positive held-out prediction and gains over low-level baselines were widespread in source-level summaries. Across Brain Treebank and Podcast ECoG, 67 of 432 evaluable rows met a controlled predictive-only criterion, and model-side feature ablations changed prediction scores in most evaluable source rows. Brain-derived, timing-linked, acoustic, and implanted-signal controls confirmed component-level sensitivity of the analysis pipeline. These findings show that language-model-derived quantities can annotate neural activity during natural speech and text comprehension. Participant-level matched-control advantages were localized rather than uniform, response-profile and feature-specificity contrasts bounded representational or computational interpretations, and complete co-indexed integrated interpretation will require future jointly indexed coverage. Together, the analyses identify language-model features as useful neural predictors and separate predictive usefulness from claims about shared neural organization or language-processing computations.
[NLP-33] Information-Aware KV Cache Compression for Long Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中因上下文长度增加而导致的关键值(Key-Value, KV)缓存规模急剧膨胀的问题,尤其在长序列预填充(prefilling)和解码(decoding)阶段表现尤为突出。现有基于注意力权重的KV缓存压缩方法虽能捕捉上下文相关性,但仅依赖注意力得分评估词元重要性,忽略了与预测不确定性及词元信息量相关的信息论信号。为此,论文提出从前瞻性视角重新审视词元重要性,引入“前向影响”(Forward Influence)这一新指标,用以衡量被压缩词元对未来上下文的影响程度。分析表明,仅依赖注意力得分选择的词元主要影响邻近上下文,而高预测不确定性的词元则对远距离未来上下文具有更强的影响力。基于此发现,论文提出一种熵感知的KV缓存压缩框架——InfoKV,通过融合词元级别的预测不确定性与层间表示演化特征,生成熵度量得分,并将其与注意力得分联合用于推理过程中的词元重要性评估。实验结果在Llama-3.1、Llama-3.2及DeepSeek-R1等模型的长上下文推理基准上验证了该方法的有效性,表明InfoKV在长序列预填充与解码场景下均显著优于现有的基于注意力的压缩方法。
链接: https://arxiv.org/abs/2606.26875
作者: Jushi Kai,Zhuiri Xiao,Alexandra Birch,Zhouhan Lin
机构: Shanghai Jiao Tong University (上海交通大学); University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Reasoning capability has advanced rapidly in large language models (LLMs), leading to an increasing size of key-value (KV) cache in both prefilling and decoding stages. Existing KV cache compression methods mainly rely on attention weights to estimate token importance. While attention effectively captures contextual relevance, it overlooks complementary information-theoretic signals related to predictive uncertainty and token informativeness. In this paper, we revisit token importance from a forward-looking perspective and introduce \textitForward Influence, a metric that measures how compressed tokens affect future contexts. Our analysis reveals that tokens selected by attention scores mainly influence nearby contexts, whereas tokens associated with high predictive uncertainty exhibit substantially stronger influence on distant future contexts. Based on the observation, we propose \textbfInfoKV, an entropy-aware KV cache compression framework that incorporates information-theoretic signals. It combines token-level predictive uncertainty with layer-wise representation evolution and integrates the resulting entropy scores with attention scores during reasoning. Experiments on long-context reasoning benchmarks with Llama-3.1, Llama-3.2, and DeepSeek-R1 demonstrate that InfoKV consistently outperforms existing attention-based KV compression methods in both long prefilling and decoding scenarios.
[NLP-34] Cascaded Multi-Granularity Pruning for On-Device LLM Inference in Industrial IoT
【速读】: 该论文旨在解决工业互联网(IIoT)边缘设备上部署大语言模型(LLM)时面临的极端压缩难题。现有结构化剪枝方法在高压缩比下性能急剧下降,主要归因于一次性重要性评估的局限性以及跨架构行为的不可预测性。为此,本文提出一种级联式多粒度剪枝框架,按粗到细的顺序依次剪除层、注意力头和前馈通道,并在各阶段间引入轻量级低秩恢复机制以重新估计组件重要性。信息论分析为该剪枝顺序提供了理论依据,同时提出了可验证的结构独立性假设(SIA),用于判断特定架构下逐组件剪枝准则的可靠性:MHA+GELU 架构满足 SIA,而 GQA+SwiGLU 架构则违反 SIA。在涵盖 8800 万至 62.5 亿参数的轴承故障诊断任务中,该框架将 MHA+GELU 架构的压缩比提升至 13.8 倍,准确率达 83.82%(较最强基线提升 3.70 个百分点),而对违反 SIA 的 GQA+SwiGLU 架构则暴露出约 74 个百分点的准确率崩溃。在 NVIDIA DGX Spark 工业回转轴承故障诊断平台上的部署结果显示,压缩模型可将推理延迟降低最高达 67.2%,峰值内存占用减少 62.5%,充分验证了其在 IIoT 边缘推理场景中的可行性。该方案的关键在于通过级联多粒度剪枝与 SIA 检查机制,实现了高压缩比下的稳定性能保持。
链接: https://arxiv.org/abs/2606.26861
作者: Jinghan Wang,Yanjun Chen,Wei Zhang,Xiaotong Huang,Tianchen Liu,Gaoliang Peng
机构: 未知
类目: Computation and Language (cs.CL)
备注: This work has been submitted to the IEEE Internet of Things Journal for possible publication
Abstract:Deploying large language models (LLMs) on Industrial Internet of Things (IIoT) edge devices demands extreme compression, yet existing structured pruning methods collapse at high compression ratios due to one-shot importance estimation, and their cross-architecture behavior remains unpredictable. This article presents a cascaded multi-granularity pruning framework that removes layers, attention heads, and feed-forward channels in coarse-to-fine order, with lightweight low-rank recovery between stages to re-estimate component importance. An information-theoretic analysis motivates this ordering, and the Structural Independence Assumption (SIA) is formalized as a checkable condition predicting whether per-component pruning criteria are reliable for a given architecture: Multi-Head Attention (MHA)+GELU designs satisfy the SIA, whereas Grouped Query Attention (GQA)+SwiGLU designs violate it. On bearing fault diagnosis spanning 88M to 6.25B-parameter models, the framework extends achievable compression to 13.8 times on MHA+GELU architectures with 83.82% accuracy (+3.70 percentage points (pp) over the strongest baseline), while exposing a ~74pp accuracy collapse on GQA+SwiGLU architectures that violate the SIA. Deployed on an industrial slewing bearing fault diagnosis platform with NVIDIA DGX Spark, compressed models reduce inference latency by up to 67.2% and peak memory by 62.5%, demonstrating viability for IIoT edge inference.
[NLP-35] FBKs Long-form SpeechLLM s for IWSLT 2026 Instruction Following
【速读】: 该论文旨在解决在资源受限条件下,如何有效实现短时与长时语音指令跟随(speech instruction following)的问题。针对短时任务,系统通过优化生成能力,在MCIF评测指标上取得了优异表现,SIFS得分为2.0708;针对长时任务,研究探索了三种语音分割方法,并引入新的评价指标HIFS以应对长序列生成中的不稳定性问题。实验结果表明,采用固定30秒的分割策略在长时任务中表现最为稳健,获得最高HIFS得分2.0663。进一步分析发现,模型在扩展至长时生成时主要出现幻觉现象,表现为输出中的重复性插入,显著影响自动语音识别(ASR)与语义一致性评分(SSUM),但短时指令跟随能力在长时扩展后仍得以较好保留。因此,解决方案的关键在于通过合理的分段策略与新评估指标设计,平衡长时生成的稳定性与指令遵循的准确性。
链接: https://arxiv.org/abs/2606.26819
作者: Zhihang Xie,Marco Gaido,Sara Papi,Matteo Negri,Luisa Bentivogli
机构: Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会); University of Trento(特伦托大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper describes our submission to the IWSLT 2026 Instruction Following shared task. SpeechLLMs are developed for both short-form and long-form speech instruction following under constrained settings. For the short track, strong performance is achieved on MCIF, with a SIFS score of 2.0708. For the long track, three speech segmentation methods are explored, and the HIFS score is introduced to account for unstable long-form generation. Experimental results show that fixed 30-second segmentation provides the most robust long-form performance, achieving the highest HIFS score of 2.0663. Further analysis shows that hallucination mainly manifests as repetitive insertions in generated outputs, substantially affecting ASR and SSUM, while short-form capabilities are largely retained after long-form extension.
[NLP-36] KARLA: Knowledge-base Augmented Retrieval for Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在生成内容时存在事实准确性不足、知识更新滞后以及可解释性差的问题。具体而言,传统LLM依赖于训练时固化在参数中的知识,导致其输出的事实无法动态更新,且难以追溯来源,同时小规模模型难以达到与大规模模型相当的准确性。本文提出的解决方案核心在于:通过训练模型在生成过程中产生特定的特殊标记(special tokens),这些标记能够触发对外部知识库(Knowledge Base, KB)的查询,从而实时获取并融入最新、可验证的事实信息。该方法实现了三重优势:(1)无需重新训练即可通过更新知识库实现事实知识的动态更新;(2)生成内容中的事实可追溯至知识库,提升透明度与可解释性;(3)使小型模型在事实准确性上逼近大型模型水平。实验表明,该方法显著提升了短文本与长文本生成中的事实一致性,并支持通过知识库编辑而非参数调整实现事实修正。
链接: https://arxiv.org/abs/2606.26807
作者: Francois Crespin,Fabian M. Suchanek(IP Paris, LTCI),Nils Holzenberger
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We propose a new method that allows an LLM to automatically pull in factual knowledge from a knowledge base during token generation. This means that (1)~factual knowledge in the LLM output can be updated without retraining the LLM, (2)~facts in the LLM output can be traced to the knowledge base for transparency and explainability, and (3)~smaller models can achieve the same factual accuracy as larger models. Our core idea is to train the model to produce special tokens that trigger a query to the knowledge base. Our experiments show that our method improves factual grounding in both short and long-form generation, and allows factual revisions to take effect through KB edits rather than parameter updates.
[NLP-37] OPID: On-Policy Skill Distillation for Agent ic Reinforcement Learning
【速读】: 该论文旨在解决语言智能体在基于结果的强化学习(outcome-based reinforcement learning)中因轨迹级稀疏奖励导致的中间决策指导不足问题,以及现有策略内自蒸馏(on-policy self-distillation)方法依赖外部技能记忆或检索到的特权上下文所带来的维护成本高和状态分布不匹配问题。其解决方案的关键在于提出一种名为OPID(On-Policy Skill Distillation)的框架,该框架通过从已完成的策略内轨迹中直接提取技能监督信号,实现密集且与当前策略分布一致的隐式经验回溯。OPID将轨迹的后见之明(hindsight)建模为分层技能:在任务层面捕捉全局工作流或避错规则,而在步骤层面捕捉关键时间节点的局部决策知识;并通过“优先关键节点路由”机制,在识别出关键决策时使用步骤级技能进行引导,否则默认采用任务级技能。所选技能被注入交互历史,使旧策略能在原始与技能增强的双重上下文中重新评分同一采样响应,由此产生的对数概率差即构成词元级自蒸馏优势。该优势与结果优势相结合,用于策略优化。因此,OPID在保持强化学习作为核心训练目标的同时,引入了密集、分布匹配的后见之明监督,显著提升了代理在ALFWorld、WebShop及基于搜索的问答任务中的性能、样本效率和鲁棒性。
链接: https://arxiv.org/abs/2606.26790
作者: Shuo Yang,Jinyang Wu,Zhengxi Lu,Yuhao Shen,Fan Zhang,Lang Feng,Shuai Zhang,Haoran Luo,Zheng Lian,Zhengqi Wen,Jianhua Tao
机构: Tsinghua University (清华大学); Zhejiang University (浙江大学); The Chinese University of Hong Kong (香港中文大学); Nanyang Technological University (南洋理工大学); Tongji University (同济大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Outcome-based reinforcement learning provides a stable optimization backbone for language agents, but its sparse trajectory-level rewards provide little guidance on which intermediate decisions should be reinforced or suppressed. On-policy self-distillation offers dense token-level supervision, yet existing skill-conditioned variants often rely on external skill memories or retrieved privileged context, which are costly to maintain and can be mismatched with the state distribution induced by the current policy in multi-turn interaction. We propose \textbfOPID (\textbfOn-\textbfPolicy Sk\textbfill \textbfDistillation), a framework that extracts skill supervision directly from completed on-policy trajectories. OPID represents trajectory hindsight as hierarchical skills: episode-level skills capture global workflows or failure-avoidance rules, while step-level skills capture local decision knowledge at critical timesteps. A critical-first routing mechanism uses step-level skills when critical decisions are identified and falls back to episode-level skills as default guidance otherwise. The selected skill is injected into the interaction history, allowing the old policy to re-score the same sampled response under both original and skill-augmented contexts. The resulting log-probability shift yields a token-level self-distillation advantage, which is combined with the outcome advantage for policy optimization. OPID thus preserves RL as the primary training objective while introducing dense, distribution-matched hindsight supervision. Experiments on ALFWorld, WebShop and Search-based QA demonstrate that OPID generally improves agent performance, sample efficiency, and robustness over outcome-only RL and existing skill-distillation baselines. Our code is available at this https URL.
[NLP-38] AIGP: An LLM -Based Framework for Long-Term Value Alignment in E-Commerce Pricing KDD2026
【速读】: 该论文旨在解决大规模电商平台中传统动态定价模型存在的可解释性差、未结构化信息利用不足以及与长期业务目标(如累计商品交易总额,GMV;投资回报率,ROI;里程碑达成率)不一致等问题。其核心解决方案是提出一种名为AIGP的新型框架,该框架通过领域知识引导的大语言模型(LLM),融合结构化数据与文本上下文,生成具备可解释性与知识感知能力的定价决策。AIGP的关键创新在于引入长周期价值评估器(Long-Term Value Estimator, LTVE),基于历史数据采用离线强化学习训练,作为奖励模型对候选定价策略进行评分,并用于构建偏好对以支持直接偏好优化(Direct Preference Optimization, DPO),从而实现定价策略与长期商业目标的对齐。为兼顾部署效率与输出质量,系统还采用监督微调进行知识蒸馏。大量离线评估及在淘工厂平台的大规模在线A/B测试表明,相较于生产基线,AIGP在14天内实现了GMV提升13.21%、ROI提升7.59%、里程碑达成率提升8.20%,同时提供透明且可解释的定价理由。
链接: https://arxiv.org/abs/2606.26787
作者: Chennan Ma,Yanning Zhang,Siqi Hong,Xiuchong Wang,Fei Xiao,Keping Yang
机构: Taobao Tmall Group of Alibaba(淘宝天猫集团阿里巴巴); Alibaba(阿里巴巴)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by KDD 2026 Applied Data Science Track (Oral presentation)
Abstract:Traditional dynamic pricing models in large-scale e-commerce suffer from limited interpretability, poor utilization of unstructured information, and misalignment with long-term business objectives such as cumulative Gross Merchandise Value (GMV), Return on Investment (ROI) and milestone achievement. We propose AIGP, a novel framework that leverages a Large Language Model (LLM) prompted with domain knowledge, structured data and textual context to make interpretable, knowledge-aware pricing decisions. For efficient deployment while maintaining high-quality outputs, we employ supervised fine-tuning for knowledge distillation. Central to AIGP is the Long-Term Value Estimator (LTVE), trained via offline reinforcement learning on historical data, which serves as a reward model to score candidate pricing actions and select preference pairs for Direct Preference Optimization (DPO), thereby aligning the pricing policy with long-term business objectives. Extensive offline evaluations and large-scale online A/B tests on Tao Factory demonstrate that AIGP achieves significant improvements: +13.21% in GMV, +7.59% in ROI, and +8.20% in milestone achievement rate over 14 days compared to the production baseline, while simultaneously providing interpretable and transparent pricing rationales.
[NLP-39] Reproducibility Study of “AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models”
【速读】: 该论文旨在解决生成式模型在知识编辑过程中存在的“灾难性遗忘”问题,即在更新特定知识时可能破坏已有知识的稳定性。其核心解决方案是提出一种基于零空间约束投影(null-space constrained projection)的方法——AlphaEdit,理论上确保编辑操作不会干扰已保留的知识。该方法的关键在于通过将编辑向量限制在模型隐层表示的零空间内,从而实现对已有知识的保护。然而,本研究的可复现性分析表明,尽管在原始模型架构下能复现报告性能,但其理论保证的实际有效性受制于模型架构和编辑规模:在新型模型中,由于架构假设不成立,优势无法泛化;当连续编辑次数大幅增加时,性能显著下降,表明零空间投影的保护作用具有边界而非绝对可靠;此外,在扩展评估中发现大规模序列编辑会损害模型在下游任务中的综合能力及安全拒答行为。因此,尽管AlphaEdit在原定范围内表现有效,但其理论保障对实际部署具有重要依赖性,需谨慎考虑模型架构与编辑规模的影响。
链接: https://arxiv.org/abs/2606.26783
作者: Ananth K S,Arya Hariharan
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 21 pages, 2 figures
Abstract:Fang et al. (2025) introduced a null-space constrained projection, named AlphaEdit, for locate-then-edit knowledge editing methods, theoretically guaranteeing that edits do not disrupt previously preserved knowledge, and reports substantial gains over existing editing methods on LLaMA3, GPT2-XL, and GPT-J. In this work, we present a reproducibility study of AlphaEdit, reproducing its reported results under the original experimental setup and extending the evaluation along three axes: new model architectures, additional downstream benchmarks, and substantially longer sequential editing horizons. We successfully reproduce AlphaEdit’s reported metrics across the original models, though we identify a discrepancy in the reported fluency and consistency metric. Extending AlphaEdit to newer model families, we find that its advantage does not generalize uniformly, which we trace to architectural assumptions in the locate-then-edit paradigm that are violated by these newer models. We further stress-test AlphaEdit’s central sequential-editing claim by extending the number of edits well beyond those evaluated in the original paper, and find that performance, which is stable at the originally reported scale, degrades as edits reach a much higher count, indicating that the null-space projection’s protection against catastrophic forgetting is bounded rather than unconditional. Finally, we extend evaluation of edited models on three extra benchmarks, namely, BoolQ, HellaSwag, and XSTest, and we find that large-scale sequential editing degrades both general downstream task competence and safety-relevant refusal behavior. Our results confirm that AlphaEdit performs as reported within its original scope, while showing that its core theoretical guarantees are sensitive to model architecture and editing scale in ways that have practical implications for its deployment.
[NLP-40] Evaluation Pitfalls and Challenges in Multimedia Event Extraction ACL2026
【速读】: 该论文旨在解决多媒体事件抽取(Multimedia Event Extraction)领域中评估方法不一致、不可比且缺乏严谨性的问题。当前研究虽取得显著进展,但其结果的可靠性高度依赖于评估过程的一致性和严格性。论文通过系统性分析发现,评估中的三大主要问题包括:数据处理不一致、任务假设不统一以及评估设置过于宽松。通过在严格评估框架下开展一系列受控实验,研究证明了微小的评估选择(如数据预处理方式或评价标准)可能导致性能指标大幅波动,并高估模型在跨模态真实事件关联上的实际能力。因此,该研究的关键解决方案在于提出并强调建立可比较、更严格的评估标准,以推动该领域向更加科学、可信的评估范式转变。
链接: https://arxiv.org/abs/2606.26775
作者: Philipp Seeberger,Steffen Freisinger,Tobias Bocklet,Korbinian Riedhammer
机构: Technische Hochschule Nürnberg Georg Simon Ohm (诺伊堡应用技术大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to ACL 2026
Abstract:Multimedia event extraction aims to jointly identify events and their arguments across multiple modalities, such as text and images, to support more comprehensive event understanding. While recent work reports steady and substantial progress, the reliability and comparability of these results critically depend on consistent and rigorous evaluation. In this work, we present the first systematic analysis of evaluation pitfalls in multimedia event extraction and identify three major sources of issues: inconsistent data processing, inconsistent task assumptions, and overly relaxed evaluation settings. We demonstrate, through a series of controlled experiments under a strict evaluation framework, that minor evaluation choices can cause large performance variations and lead to overestimation of a model’s ability to ground real-world events across modalities. Our findings highlight the need for comparable evaluation standards and encourage a shift toward more rigorous evaluation in multimedia event extraction.
[NLP-41] Structure Before Collapse: Transient semantic geometry in next-token prediction
【速读】: 该论文试图解决的核心问题是:尽管语言模型在训练中采用一热编码(one-hot)标签,导致上下文共现统计信息因稀疏性而退化,无法显式传递语义关联,但模型仍能学习到隐含的语义结构(如“Mary broke the ___”中的空缺词更可能为“medium-sized”“rigid”“inanimate”类名词),这种现象与神经坍缩(Neural Collapse)理论预测的对称性配置存在矛盾。其解决方案的关键在于揭示:在缺乏显式语义监督的情况下,梯度下降过程本身会促使模型在训练初期自发形成基于共享属性的表示聚类,即隐含的语义几何结构在早期阶段便已涌现;这一结构虽非最终稳定状态,但具有临时性,随着模型容量和训练时间增加,系统最终会演化至由神经坍缩理论所预测的完全对称、等距分离的表示配置。研究通过构造三种受控合成场景验证了该现象,并利用核矩阵(Gram matrix)分析揭示了从语义结构到对称状态的相变过程,进而提出对常规无约束特征模型的初步修正以捕捉此动态涌现的语义几何。
链接: https://arxiv.org/abs/2606.26749
作者: Yize Zhao,Isabel Papadimitriou,Christos Thrampoulidis
机构: The University of British Columbia (不列颠哥伦比亚大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Neural Collapse predicts that balanced one-hot classification pushes model representations to be equally far from each other; a symmetric configuration that depends only on the output label and ignores any semantic similarity in the inputs. This creates a puzzle: next-token prediction language models are trained predominantly (as context length increases) with one-hot labels: the same context is very unlikely to appear twice in training with different labels. However, they clearly learn latent structural features. That is, despite the one-hot training regime, a language model’s contextual embeddings represent the fact that the next word in ‘‘Mary broke the ___’’ is likely to be filled by tokens in the latent classes of a) medium-sized, b) rigid, c) inanimate nouns. How does gradient descent find such categorical semantic structure when co-occurrence statistics collapse to one-hot sparsity, eliminating any shared next-tokens among different contexts? To investigate this tension we identify three synthetic controlled settings where inputs have latent semantic factors but are mapped to distinct one-hot labels. We find that semantic geometry emerges early in training, and that representations cluster by shared attributes despite receiving no explicit supervision to do so. This structure is transient: with sufficient capacity and time, the model eventually reaches the predicted symmetric state where all representations are equally separated. We study this phase transition through Gram matrix analysis and propose a preliminary modification to the commonly used unconstrained features model to capture the emergent semantic geometry.
[NLP-42] HyperDFlash: MHC-Aligned Block Speculative Decoding with Gated Residual Reduction
【速读】: 该论文旨在解决在DeepSeek-V4提出的多超连接(multi-hyper-connection, MHC)架构下,传统推测解码(speculative decoding)方法因特征错位而导致的生成质量下降与接受率降低问题。尽管DeepSeek-V4原生的多令牌预测(Multi-Token Prediction, MTP)模块在初始阶段具有优异的草稿生成能力,但随着生成位置后移,未验证中间令牌的误差累积导致草稿准确性急剧下降,进而影响解码效率。现有高效的一次性块草稿方法DFlash无法直接适配MHC架构,因其多路径残差流结构与传统草稿设计存在特征对齐偏差。为此,论文提出两项模型对齐优化:其一,采用预坍缩残差状态作为唯一条件信号,保留多路径结构信息,使草稿器与目标模型的原生预测路径保持一致;其二,以轻量级门控残差压缩器替代传统的重型通用线性压缩器,其参数源自内置的超连接头,实现输入感知的路径聚合,参数量减少三个数量级的同时维持架构一致性。此外,通过在语言模型输出头引入针对性的KL散度蒸馏损失,对早期训练阶段的预测进行正则化,提升草稿质量。实验结果表明,HyperDFlash在数学推理、代码生成和对话任务中均显著优于原生MTP基线与普通DFlash适配方案,在平均被接受草稿长度与解码加速比方面均有显著提升,验证了MHC对齐、门控压缩与靶向蒸馏在高性能推测解码中的有效性。
链接: https://arxiv.org/abs/2606.26744
作者: Luxi Lin,Shuang Peng,Rui Ma,Junhao Hua,Shuwei Fan,Zhengda Qin,Qiang Wang,Hongjian Sun,Fangmin Chen,Songwei Liu
机构: ByteDance(字节跳动); Zhejiang University (浙江大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:We present HyperDFlash, a block-parallel speculative decoding framework tailored to the novel multi-hyper-connection (MHC) architecture proposed by DeepSeek-V4. Despite the strong initial-token drafting performance of the native Multi-Token Prediction (MTP) module in DeepSeek-V4, its draft accuracy degrades sharply at later positions, as error accumulation from unverified intermediate tokens harms acceptance rates. Although the original DFlash method supports efficient one-pass block drafting, it cannot be seamlessly adapted to the MHC paradigm, since the multi-path residual stream of DeepSeek-V4 induces feature misalignment with conventional drafting designs. To resolve this mismatch, we propose two model-aligned optimizations for MHC residual streams. First, we adopt pre-collapse residual states as the exclusive conditioning signal, preserving multi-path structural information and aligning the drafter with the native prediction pathway of the target model. Second, we replace the heavy generic linear compressor with a lightweight gated residual reducer, whose parameters are inherited from the built-in hyper-connection head. This design yields input-aware path aggregation with three orders of magnitude fewer parameters while maintaining architectural alignment. We further enhance training via a targeted KL distillation loss applied to the LM-head, which regularizes predictions against the full target probability distribution and improves draft quality at early training stages. Experiments across math reasoning, code synthesis, and conversational benchmarks show that HyperDFlash consistently outperforms both the native MTP baseline and vanilla DFlash adaptation. It achieves substantial gains in average accepted draft length and decoding speedup, validating the effectiveness of MHC alignment, gated reduction, and targeted distillation for high-performance speculative decoding.
[NLP-43] Beyond Logical Forms: LLM -Extracted Patterns for Fallacy Classification
【速读】: 该论文旨在解决逻辑谬误(Logical Fallacy)在复杂语境中呈现细微且多样化特征,导致自动化分类困难的问题。其核心挑战在于如何有效捕捉谬误的抽象推理结构与上下文语言线索之间的关联。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)从谬误实例及其解释中归纳性地提取抽象的逻辑结构模式,并将这些模式与上下文语义特征融合,构建可泛化的逻辑表征。实验表明,该方法在零样本和单样本设置下均显著优于基线模型,且跨数据集验证了其良好的泛化能力,证明基于数据驱动的逻辑模式提取是生成高质量逻辑表示的有效途径。
链接: https://arxiv.org/abs/2606.26698
作者: Eleni Papadopulos,Firoj Alam,Giovanni Da San Martino
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:In today’s fast-paced information era, logical fallacies, defined as defective patterns of reasoning, inevitably contribute to the growth of information disorder. However, often fallacies appear in nuanced forms that complicate automated classification. In this study, we investigate whether merging abstract logical structures with context-level linguistic cues proves beneficial for fallacy classification, developing a framework that inductively extracts such patterns from fallacious examples and their explanations using Large Language Models (LLMs). We evaluate the impact of these patterns across different LLMs and experimental zero- and one-shot configurations, showing statistically significant improvements over zero-shot baselines and outperforming competing approaches. Cross-dataset experiments validate generalization, establishing data-driven pattern extraction as an effective method for generating logical representations.
[NLP-44] Do Safety Guardrails Need to Reason ? LeanGuard: A Fast and Light Approach for Robust Moderation
【速读】: 该论文旨在解决当前安全护栏(guardrail)机制依赖链式思维(Chain-of-Thought, CoT)推理所带来的计算开销大、响应延迟高的问题。尽管现有方法普遍认为逐步推理能够提升决策准确性,但这种设计在实际部署场景中——如运行在具身机器人等资源受限的设备上——显得过于笨重且效率低下。为此,作者提出核心质疑:安全护栏是否真的需要进行推理?其解决方案的关键在于通过控制变量实验,仅移除推理模块而保持其他架构与训练数据一致,构建了一个轻量级双向编码器模型(LeanGuard)。实验结果表明,该无需推理的编码器在多个公开基准上达到平均F1为82.90 ± 0.26,性能媲美基于更大规模解码器的推理型护栏,同时仅需单次前向传播处理最多512个标记,推理计算量降低约100倍。此外,该模型在标签噪声下仍保持鲁棒性,并在严格假阳性率约束下显著提升召回率,证明了非推理型护栏在可靠性与效率上的优势。研究进一步指出,现有护栏评估基准可能不足以凸显推理带来的收益,因此链式思维在内容审核中的必要性尚未得到证实。
链接: https://arxiv.org/abs/2606.26686
作者: Dongbin Na
机构: Pohang University of Science and Technology (浦项科技大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 6 figures, 3 tables. Project page: this https URL ; code and models: this https URL
Abstract:In order to screen a prompt or a response, the recent guardrail methods generate a chain-of-thought (CoT) before they issue a verdict. This design follows a common belief that step-by-step reasoning improves a decision. However, CoT also makes the guard heavy and slow, because the model must generate many tokens before it decides. This may not match how guardrails are actually deployed. A guardrail sometimes should not be heavy and slow, and it often runs on-device, for example on an embodied robot. In this paper, we pose a question whether a safety guardrail really needs to reason. To answer this question, we train a lightweight bidirectional encoder and a reasoning guard on the same corpus, and we then remove only the reasoning while we keep everything else fixed. With this controlled same-base comparison, we show that the chain does not improve moderation accuracy. We name the resulting guard LeanGuard. A 395M label-only encoder reaches an average F1 of 82.90 \pm 0.26 over public benchmarks. It matches a reasoning guard that is built on a much larger decoder, while it uses only a single forward pass over an input of at most 512 tokens. This is about a ~100x reduction in inference compute. We further show that this label-only encoder stays robust under training-label noise and retains far more recall at a strict false-positive rate than the reasoning guard, so a heavier reasoning guard is not the more robust choice either. Our finding suggests that the current guardrail benchmarks may not be hard enough to reward reasoning, and that the necessity of CoT for moderation is still not proven. We release all source codes and models including LeanGuard at this https URL.
[NLP-45] CAT-Q: Cost-efficient and Accurate Ternary Quantization for LLM s ICML2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在量化压缩过程中因传统三值量化(Ternary Quantization)方法导致性能严重下降的问题,尤其针对现有先进方法依赖数据密集且成本高昂的量化感知训练(Quantization-Aware Training, QAT)所带来的高计算开销与部署复杂性。其核心解决方案是提出一种高效、准确的后训练三值量化方法CAT-Q,其关键在于两个协同优化的组件:可学习调制(Learnable Modulation, LM)与软化三值化(Softened Ternarization, ST)。LM通过引入可学习因子组合,动态调节预训练高精度权重分布及三值阈值,降低权重对三值化过程的敏感性;ST则设计可微分的过渡函数,引导三值化过程实现稳定收敛。实验表明,CAT-Q仅需512个校准样本即可将1.7B至8B参数的LLMs高效量化为三值模型,性能优于采用1000亿训练令牌的BitNet 1.58-bit系列模型,训练资源消耗减少约10万倍;同时首次实现对14B至235B参数规模的大模型在8张A100-80GB GPU上8至60小时内完成高质量三值化,显著提升了大规模模型量化效率与实用性。
链接: https://arxiv.org/abs/2606.26650
作者: Shigeng Wang,Chao Li,Yangyuxuan Kang,Jiawei Fan,Anbang Yao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This work is accepted to ICML 2026 as an oral. The project page: this https URL
Abstract:In this paper, we present CAT-Q, Cost-efficient and Accurate Ternary Quantization, for compressing and accelerating LLMs. Unlike existing state-of-the-art ternary quantization methods that rely on data-intensive and costly quantization-aware training to mitigate severe performance degradation, CAT-Q is a simple yet effective post-training quantization scheme that is readily applicable to LLMs with diverse architectures and model sizes. It has two key components, learnable modulation (LM) and softened ternarization (ST), which are coupled from an optimization perspective. LM leverages a composition of learnable factors to modulate the distribution of pre-trained high-precision weights and the ternary threshold, making them less sensitive to ternarization. ST further introduces a differentiable transition function to guide the ternarization process toward stable convergence. We show that, for pre-trained LLMs with 1.7B to 8B parameters, CAT-Q can efficiently quantize them into ternary models using only 512 calibration samples, while achieving superior performance than the seminal BitNet 1.58-bit v1 and v2 families (with 1.3B to 7B parameters) trained with 100B tokens, yielding about a 100,000X reduction in training tokens. Moreover, we show for the first time that CAT-Q can quantize much larger pre-trained LLMs having 14B to 235B parameters into leading ternary models within just 8 to 60 hours on 8 A100-80GB GPUs. Code is available at this https URL.
[NLP-46] From Weights to Features: SAE-Guided Activation Regularization for LLM Continual Learning
【速读】: 该论文旨在解决大语言模型在持续学习(continual learning)中面临的灾难性遗忘问题,尤其针对传统权重空间正则化方法(如弹性权重巩固,EWC)在大规模模型上表现不佳的瓶颈。其核心问题在于:大语言模型具有“多义性”(polysemantic)特性,即单个权重参数通常编码多个语义概念,导致EWC等方法基于权重重要性的保护机制过于粗粒度,无法精准识别并保留关键知识。为此,论文提出一种在激活空间(activation space)进行正则化的全新方案,利用预训练稀疏自编码器(SAE)构建单一语义特征字典,将模型表征映射至更语义清晰、低维的特征空间。从约束优化视角推导出新的损失函数,通过该特征字典显式平衡模型的稳定性与可塑性,证明EWC仅为单边权重空间惩罚的特例。与依赖存储或重访历史数据的回放类方法不同,本方法仅需在初始阶段构建特征掩码,后续训练仅保留该紧凑掩码,无需保存任何过往任务数据,显著提升内存效率。在TRACE和MedCL基准测试中,该方法在不引入任务特定架构的前提下达到最优性能,超越传统权重空间正则化方法。此外,实验证据支持“多义性假说”:在SAE特征基下,任务相关表示呈线性可分,而在权重基下则近乎随机不可区分,表明权重空间保护在概念层面几乎不具备选择性。
链接: https://arxiv.org/abs/2606.26629
作者: Evan Ning,Wei Xue,Dong Lou,Yike Guo
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 21 pages, 4 figures, 6 tables
Abstract:Weight-space regularization methods such as Elastic Weight Consolidation (EWC) are the standard approach to catastrophic forgetting in continual learning. However, those methods tend to underperform when applied to large language models. We argue that such underperformance can be partly explained by the ``polysemantic’’ nature of large language models: per-weight importance estimates utilized by EWC-style regularization are too coarse and cannot isolate the knowledge that needs protection. In this paper, we propose regularizing instead in the model’s activation space, using pretrained Sparse Autoencoders (SAEs) as a monosemantic feature dictionary. From the perspective of constrained optimization, we derive a new loss function that uses the SAE feature dictionary to explicitly balance stability and plasticity, and show that EWC is a special case in the one-sided weight-space penalty setting. Unlike replay-based methods that store or revisit examples from earlier tasks, our method requires no previous-task data after mask construction: current-task data is used to compute a compact SAE feature mask, and only this mask is retained for later training. Further, since the feature space has significantly lower dimensionality than the parameter space, the proposed method is more memory efficient. On the TRACE and MedCL continual learning benchmarks, the method achieves the strongest result among approaches without introducing task-specific architectural components, also surpassing traditional weight-space regularization methods like EWC. Beyond performance comparisons, we provide empirical evidence for the polysemanticity thesis: task-relevant representations are linearly separable in the SAE feature basis but indistinguishable from chance in the weight basis, and weight-space protection is nearly non-selective at the concept level.
[NLP-47] Closing the Quality Gap in Low-Resource Text-to-Speech: LoRA Fine-Tuning of VoxCPM2 for Khmer and Korean
【速读】: 该论文旨在解决生成式语音合成(Text-to-Speech, TTS)模型在低资源语言上的音质显著下降问题,尤其针对高资源语言(如英语)与低资源语言(如高棉语、韩语)之间的性能差距。现有大规模预训练TTS模型在高资源语言上已接近人类水平,但在低资源语言上的表现明显不足。为应对这一挑战,研究提出一种基于共享语言标记语料库的统一微调策略,采用参数量达24亿的无分词器TTS模型VoxCPM2(结合MiniCPM-4语言模型主干与流匹配扩散解码器),通过引入单个零初始化的低秩适应(Low-Rank Adaptation, LoRA)适配器,在同时训练高棉语和韩语时仅微调0.19%至3.03%的模型参数。实验结果表明,该方法使高棉语的平均意见评分(Mean Opinion Score, MOS)从3.85显著提升至4.23(最佳适配器秩为64,配对威尔科克森检验p < 0.001),验证了适配器在基础模型薄弱的语言上具有显著增益。值得注意的是,自动损失指标与人工主观评价在最优适配器秩选择上存在分歧:验证损失最低出现在秩128,而主观评分峰值位于秩64,反映出自动评估指标与人类感知之间可能存在脱节。此外,该适配器对已有良好表现的韩语未带来提升,甚至在高秩时导致质量下降,进一步说明该方法的有效性主要体现在基础模型原本表现较弱的语言上。因此,解决方案的关键在于:利用零初始化的轻量化LoRA适配器实现跨语言高效迁移,仅需极小参数更新即可显著提升低资源语言的语音合成质量,且其效果高度依赖于基础模型在目标语言上的初始表现。
链接: https://arxiv.org/abs/2606.26618
作者: Phannet Pov,Sovandara Chhoun,Hyun Woo Park,Wan-Sup Cho,Saksonita Khoeurn
机构: 1. National University of Cambodia (柬埔寨国家大学); 2. Phnom Penh Institute of Technology (金边技术学院); 3. Korea Institute of Science and Technology (韩国科学技术院); 4. Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注: 5 pages, 1 figure, 4 tables. IEEE conference format (IEEEtran)
Abstract:Large pretrained text-to-speech (TTS) models sound almost human for well-resourced languages, but much worse for languages that are rare in their training data. We study this quality gap for Khmer and Korean using VoxCPM2, a 2.4B-parameter, tokenizer-free TTS model that joins a MiniCPM-4 language-model backbone with a flow-matching diffusion decoder. We build one shared, language-tagged corpus of about 26 hours and adapt VoxCPM2 with a single Low-Rank Adaptation (LoRA) adapter, trained on both languages at once and added to both the language model and the decoder. The adapter is zero-initialized, so training starts exactly at the original (zero-shot) model. In native-speaker listening tests, the Khmer Mean Opinion Score (MOS) rises from 3.85 to 4.23 with the best adapter (rank 64), a highly significant gain (paired Wilcoxon test, p0.001), while training only 0.19 to 3.03 percent of the parameters. The automatic loss and the human ratings, however, disagree on the best rank: validation loss is lowest at rank 128, yet MOS peaks at rank 64. The same adapter brings no gain for Korean, a language the base model already handles well, and at a high rank it even degrades quality. Adaptation therefore helps mainly where the base model is genuinely weak.
[NLP-48] Zero-shot Tweet-Level Stance Detection Enhanced by External Knowledge and Reflective Chain-of-Thought Reasoning
【速读】: 该论文旨在解决零样本(zero-shot)层面的推文立场检测中存在的两大核心问题:一是短文本固有的上下文稀疏性问题,二是隐含立场目标(implicit target)与文本内容之间相关性的判定难题。现有方法虽多依赖外部知识增强,却忽视了文本内部关键实体所蕴含的内在语义线索;同时,模型在处理未见目标时缺乏对“中立”(neutral)与“无关”(irrelevant)标签的有效区分能力。为此,本文首先构建了一个四分类、多主题的日本推文数据集(KIRP-D),据我们所知,这是首个面向日语文本立场检测的公开数据集。在此基础上,提出KIRP框架,其关键在于融合外部知识与实体重组以实现数据增强,并采用提示链(prompt chaining)机制进行推理。具体而言,通过引入知识图谱对关键文本实体进行补全与重构,结合反思式思维链(reflective Chain-of-Thought, CoT)推理来提取并验证隐含立场目标。为更好区分“中立”与“无关”类别,进一步引入立场感知对比学习(stance-aware contrastive learning)以捕捉判别性特征,并设计三层迭代原型网络(three-layer iterative prototype network)实现细粒度分类。实验结果表明,KIRP在SemEval-2016、WT-WT及KIRP-D数据集上均达到领先性能,三分类下在SemEval-2016上取得84.05%的F1分数,四分类下在WT-WT和KIRP-D上分别达到84.99%和79.18%的F1分数,显著提升了零样本场景下的立场识别准确率与鲁棒性。
链接: https://arxiv.org/abs/2606.26571
作者: Yiju Huang,Wenxian Wang,Lijun Zhou,Rui Tang,Xiao Lan,Tao Zhang,Haizhou Wang
机构: Sichuan University (四川大学); Beijing Jiaotong University (北京交通大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Zero-shot tweet-level stance detection confronts two primary challenges: (1) mitigating the context sparsity inherent in short texts, and (2) establishing the relevance between implicit targets and textual content. While existing methods primarily focus on incorporating external knowledge, they neglect the intrinsic semantic cues embedded within key intra-textual entities. Furthermore, current models exhibit limited capability in determining the relevance of unseen targets to the given text, thereby struggling to differentiate between “neutral” and “irrelevant” stance labels. To address these issues, we first construct a four-class, multi-topic Japanese tweet dataset. To our knowledge, this is the first Japanese tweet-level dataset for stance detection. We then propose KIRP, a zero-shot stance detection framework. It integrates external knowledge with entity reorganization for data augmentation and employs prompt chaining for reasoning. Specifically, the framework incorporates knowledge graphs to supplement and reorganize key textual entities, while reflective Chain-of-Thought (CoT) reasoning extracts and validates implicit targets. To better distinguish “neutral” from “irrelevant” labels, we adopt stance-aware contrastive learning to capture discriminative features and design a three-layer iterative prototype network for fine-grained classification. Experimental results on SemEval-2016, WT-WT, and KIRP-D show that KIRP achieves state-of-the-art performance. KIRP obtains F1 scores of 84.05% (three-class) on SemEval-2016, and 84.99% and 79.18% (four-class) on WT-WT and KIRP-D, respectively.
[NLP-49] Adversarial Diffusion Across Modalities: A Fusion Survey of Attacks Defenses and Evaluation for Text Vision and Vision-Language Models
【速读】: 该论文旨在解决当前生成式人工智能(Generative AI)对抗评估领域中四大研究方向——针对文本与大语言模型(LLM)的扩散攻击、针对图像分类器的扩散攻击、针对视觉-语言模型的越狱(jailbreak)流程,以及基于扩散的输入净化防御——之间因术语体系、威胁模型和基准测试不统一而造成的割裂问题。其核心解决方案在于构建一个统一的概念框架,将这四个独立的研究轨迹整合为一套共享的分类学、评估标准与研究议程,聚焦于大语言模型(LLM)作为攻击目标的场景。关键创新在于提出一种六类扩散角色的分类体系,并引入一个包含攻击者知识水平、查询预算与目标可访问性三个维度的威胁模型轴,同时采用五维评估框架(攻击成功率、迁移性、查询预算、困惑度、防御规避能力)实现跨模态的一致性评价。研究还从攻防双重视角出发,系统梳理了四类基于扩散的防御方法,作为新攻击方法的自然评估背景。通过对50篇相关文献的综述分析,识别出当前LLM侧研究中的五大共性缺陷,并据此提出一系列开放性问题与可操作的实验设计,推动该领域的规范化与可复现性发展。配套的文献目录与数据表格已公开发布,强调本研究为具有质量评估的叙事性综述,非遵循PRISMA指南的系统性综述,进而讨论了其对研究可重复性的启示。
链接: https://arxiv.org/abs/2606.26566
作者: Abrar Alotaibi,Moataz Ahmed
机构: King Fahd University of Petroleum & Minerals (沙特法赫德国王石油与矿业大学); Imam Abdulrahman Bin Faisal University (阿卜杜勒拉赫曼·本·费萨尔大学); SDAIA-KFUPM Joint Research Center for Artificial Intelligence (SDAIA-KFUPM 人工智能联合研究中心)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Adversarial evaluation of AI systems has matured along four largely disconnected tracks: diffusion-based attacks on text and large language models (LLMs), diffusion-based attacks on image classifiers, jailbreak pipelines against vision-language models, and diffusion-based input purification defenses. Each has developed its own vocabulary, threat models, and benchmarks, with denoising diffusion models emerging as a shared generative mechanism whose recipes are now actively ported between communities. This survey performs an information-fusion exercise at the meta-research level: we integrate these four tracks into a single conceptual framework with a unified taxonomy, evaluation criteria, and research agenda, focusing on the LLM-side slice. We catalog fifty published papers across four scope areas (text/LLM, image classifier, vision-language model, defense), plus four diffusion-LLM-as-victim entries and ten non-diffusion baselines against which any new attack must be compared. We propose a six-class taxonomy of diffusion roles in adversarial pipelines, augmented by a threat-model axis recording attacker knowledge, query budget, and target accessibility, and apply a five-dimension framework (attack success rate, transferability, query budget, perplexity, defense-evasion) uniformly across modalities. The review adopts a dual attacker-defender perspective: alongside the attack catalog we cover four diffusion-based defenses that form the natural evaluation backdrop for new attacks. Our critical analysis identifies five recurring weaknesses of the current LLM-side literature, and we close with a research agenda of open questions and concrete experimental designs. The companion catalog and spreadsheet are released with the paper. We are explicit that this is a narrative review with quality assessment, not a PRISMA-compliant systematic review, and discuss the implications for replication.
[NLP-50] Erase-then-Delta Attention: Decoupling Erase and Write Addresses in Delta-Rule Linear Attention
【速读】: 该论文旨在解决传统基于Delta规则的线性注意力(Delta-rule linear attention)在循环记忆更新中对过时信息处理能力不足的问题。具体而言,现有方法仅能在当前写入地址处进行内容修正,却无法主动清除其他地址上已过时的记忆内容,导致记忆污染问题在长序列任务中逐渐累积。其解决方案的关键在于提出一种名为“先擦除后增量”(Erase-then-Delta Attention, EDA)的新记忆更新机制,该机制通过解耦“擦除位置”与“写入位置”,使模型能够独立选择并针对性地清除特定地址上的陈旧信息。具体实现上,EDA首先沿学习得到的擦除方向执行精准擦除操作,随后在当前写入方向上执行标准的增量修正写入。这一设计在保留原Delta规则修正能力的同时,显著增强了模型对记忆状态的主动管理能力。大规模语言模型预训练实验表明,无论是在密集型2.5B参数模型还是稀疏专家混合(MoE)25B-A2.8B模型上,EDA均表现最优;尤其在经过800亿词元长上下文微调后,其在4k至128k长度的长文本评估中持续领先。紧凑的更新分析与记忆状态探针进一步揭示:当被动衰减机制较弱时,EDA能最有效地利用额外的清理路径,从而维持更清洁、更有效的记忆状态。研究结果表明,循环记忆模型不仅应决定“写什么”,还应自主决策“清除何处的过时信息”,这为下一代高效记忆系统的设计提供了关键启示。
链接: https://arxiv.org/abs/2606.26560
作者: Xiao Li,Chengruidong Zhang,Hao Luo,Xi Lin,Zekun Wang,Zihan Qiu,Yunfei Mao,Langshi Chen,Man Yuan,Minmin Sun,Huiqiang Jiang,Siqi Zhang,Rui Men,Wei Hu,Gong Cheng,Bo Zheng,Dayiheng Liu,Jingren Zhou
机构: Qwen Team; Nanjing University; Zhejiang University
类目: Computation and Language (cs.CL)
备注:
Abstract:Delta-rule linear attention improves recurrent memory updates by correcting what is already stored at the current write address before writing new content. However, the active correction is still anchored to that same write address. As a result, stale information stored at a different address cannot be actively removed before new content is written elsewhere. We propose Erase-then-Delta Attention (EDA), a memory update rule that decouples where to erase from where to write. The key insight is that recurrent memory models should not only correct the current write, but also selectively suppress outdated memory at an independently chosen address. Concretely, our method first applies a targeted erase step along a learned erase direction, and then performs the standard delta-style corrective write along the current write direction. This preserves the corrective behavior of delta-rule updates while expanding their memory-management capacity. Language-model pretraining experiments across dense 2.5B and MoE 25B-A2.8B model families show that EDA performs best in both settings. The gain persists after 80B-token long-context midtraining of the MoE models, where EDA also performs best in long-context evaluations from 4k to 128k contexts. A compact update analysis and memory-state probes suggest why: EDA keeps the delta-rule corrective write intact while allocating an additional cleanup path most strongly when passive decay is weak. These results suggest that recurrent memory models should decide not only what to write, but also what stale information to erase and where.
[NLP-51] Compiler-Driven Approximation Tuning for Hyperdimensional Computing
【速读】: 该论文旨在解决在摩尔定律逼近物理与经济极限背景下,如何通过领域专用方法高效加速机器学习工作负载的问题,尤其聚焦于高维计算(Hyperdimensional Computing, HDC)这一新兴范式。传统深度学习技术在能效和硬件适配性方面面临瓶颈,而HDC凭借其源于认知计算模型的特性,具备天然的噪声与近似容忍能力,可有效映射至异构硬件平台(如CPU、GPU、FPGA)及新兴存内计算技术(如阻变存储器ReRAM和相变存储器PCM)。然而,面对近似策略空间呈指数级增长的挑战,如何自动识别并应用高效益的领域特定近似成为关键难题。论文提出的解决方案核心是提出ApproxHDC框架,该框架基于HPVM-HDC编译器基础设施,实现了跨多种硬件后端(包括CPU、GPU及模拟的ReRAM/PCM加速器)的可重定向编译,并通过高效的搜索与分析机制,在软硬件协同层面自动探索并定位具有显著性能提升潜力的近似配置,从而在保持较低精度损失的前提下实现大幅提升的计算效率。
链接: https://arxiv.org/abs/2606.26547
作者: Xavier Routh,Abdul Rafae Noor,Akash Kothari,Zheyu Li,Mahbod Afarin,Tajana Rosing,Vikram Adve
机构: University of Illinois Urbana-Champaign(伊利诺伊大学厄本那-香槟分校); University of California San Diego(加州大学圣地亚哥分校)
类目: Programming Languages (cs.PL); Computation and Language (cs.CL); Performance (cs.PF)
备注:
Abstract:As Moore’s law reaches its physical and economic limits, domain-specific approaches are increasingly employed to accelerate machine learning workloads. Hyperdimensional Computing (HDC) represents one such emerging paradigm, offering an alternative to conventional deep learning techniques. Rooted in cognitive models of computation, HDC is designed bottom-up with hardware efficiency as a first-class objective. HDC workloads map naturally to heterogeneous hardware platforms, including CPUs, GPUs, and FPGAs, as well as emerging in-memory computing technologies such as Resistive RAM (ReRAM) and Phase-Change Memory (PCM). HDC algorithms are intrinsically tolerant to noise and approximation, enabling substantial performance gains with minimal accuracy loss. In this work, we introduce ApproxHDC, a framework for automated identification and application of domain-specific approximations in HDC workloads. ApproxHDC extends the HPVM-HDC compiler infrastructure to enable retargetable compilation across diverse hardware backends, including CPUs, GPUs, and simulated ReRAM and PCM-based accelerators. The space of possible approximations is exponentially large; ApproxHDC employs efficient search and analysis to navigate it and identify high-impact configurations spanning both software and hardware levels.
[NLP-52] textscDiARC: Distinguishing Positive and Negative Samples Helps Improving ARC-like Reasoning Ability of Large Language Models
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在抽象与推理任务(如 Abstraction and Reasoning Corpus, ARC)中表现不佳的问题,特别是针对基于开源模型的解决方案普遍效果不理想、而依赖闭源模型则成本过高的困境。其核心挑战在于:现有方法主要依赖正样本监督进行训练,缺乏对错误推理路径的有效区分能力,从而限制了模型的泛化与推理深度。本文提出的关键解决方案是引入“偏好对齐”(preference alignment)机制,构建具有信息量的负样本以增强模型的辨别能力。具体而言,提出三种负样本构造方式:输出级视觉变换、领域特定语言(DSL)规则反转以及任务特异性规则编辑,这些负样本在保持原始示范不变的前提下,提供具有误导性的近似错误解,从而帮助模型学习识别错误模式并优化推理过程。实验结果表明,所提出的 \textscDiARC 方法在多个类 ARC 基准测试中均显著优于基线模型,验证了通过负样本引导提升模型推理能力的有效性。
链接: https://arxiv.org/abs/2606.26530
作者: Yuxuan Yang,Feiyang Li,Yile Wang
机构: Shenzhen University (深圳大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The Abstraction and Reasoning Corpus (ARC;~\citealpchollet2019measure) contains tasks that require summarizing patterns from limited grid samples and predicting output grids. Recently, many large language model based approaches have attempted to transform it into a text-based reasoning task. However, methods based on open-source models have generally yielded unsatisfactory results, while those relying on closed-source models are too costly. Current efforts mainly focus on data augmentation, constructing ARC-like data for more comprehensive supervised fine-tuning. In this work, we argue that solving ARC-like problems requires not only \textitpositive sample supervision but also the ability to improve model reasoning by distinguishing \textitnegative samples. To this end, we draw on the idea of preference alignment and propose \textscDiARC, a method that constructs preference pairs to enable the model to distinguish between them. Specifically, we propose three ways to construct negative samples, including output-level visual transformations, DSL-level rule inversion, and task-specific rule editing. The resulting negative samples provide informative near-miss alternatives while keeping the observed demonstrations unchanged. Experimental results across multiple ARC-like benchmarks show that \textscDiARC consistently improves performance over baseline models. The code is released at this https URL.
[NLP-53] he Inattentional Gap: Task-Conditioned Language and Vision Models Omit the Safety-Critical Signals They Can Otherwise Report
【速读】: 该论文旨在解决当前人工智能(AI)安全评估中存在的根本性缺陷:现有评估体系仅衡量模型对预设危害的检测能力,而忽视了模型在实际应用中可能因忽略未明确指定但共现且具有安全关键性的信号而导致事故的问题。其核心问题是,当语言或视觉模型被限定在狭窄任务下运行时,会系统性抑制对其他潜在安全相关信号的报告,这种现象类似于人类的“无意盲视”(inattentional blindness),但由不同的机制驱动——即任务专注导致的感知抑制。研究发现,在放射科影像分析与驾驶场景文本任务中,所有测试模型均表现出显著的信号报告抑制,且该现象不随模型规模增大而缓解,即使在具备推理能力的模型中依然存在,其抑制程度主要受模型架构家族影响而非参数量。更重要的是,当模型不受任务约束时,其对这些关键信号的报告率显著提升。作者将这一现象命名为“无意间隙”(Inattentional Gap),并指出它导致了基准评估安全性能与真实世界安全性之间的严重脱节:一个模型可能在标准测评中表现近乎完美,却对真正引发事故的未被明示的危害完全失察。解决方案的关键在于重新设计评估范式,引入对共现、未指定但高危信号的检测能力作为核心指标,以弥合评估结果与实际安全之间的鸿沟。
链接: https://arxiv.org/abs/2606.26529
作者: Kwan Soo Shin
机构: PolymathMinds Lab(韩国首尔)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 8 figures. Reproducibility deposit: this https URL
Abstract:AI safety is evaluated by how reliably a model detects the hazards it is told to find, yet accidents often arise from the hazard no one specified. We show that conditioning a language or vision model on a narrow task suppresses its reporting of co-present, safety-critical signals it can otherwise report, a machine analogue of human inattentional blindness arising from a different mechanism. Across radiology and driving text scenarios and chest-radiograph vision tasks, suppression appeared in every model tested, did not diminish with scale, persisted in a reasoning model, and varied more by model family than by size, while the same models reported these signals at substantially higher rates when unconstrained. We name this dissociation the Inattentional Gap and argue that it decouples measured benchmark safety from real-world safety: a system can score near-perfectly on the hazards an evaluation specifies while remaining blind to those that cause harm.
[NLP-54] Assessing Post-Reform Changes in Risk Disclosure Quality with a Multidimensional Text Analysis Approach
【速读】: 该论文旨在解决企业叙事披露(narrative disclosures)在时间维度上定性变化的全面评估难题。由于叙事文本具有多维特性,某一维度的改善常与其他维度的变化并存,传统单指标方法难以揭示其内在动态。为此,研究提出一种纵向文本分析框架,结合日语自然语言处理(NLP)指标提取、配对检验、移位函数分析(shift function analysis)及指标间相关性分析,突破了既有指标集的局限性,创新性地引入跨截面相关性指标(cross-section relevance indicator),用以衡量风险披露与管理策略之间的主题一致性。基于该框架对日本2019年披露改革的实证分析显示,在十年期(FY2015–FY2024)内对19,770个公司-年度观测值的联合分析揭示出复杂的披露模式演变:尽管披露量显著上升,但可读性下降;整体信息结构虽有改善,但特定描述质量停滞不前,且适应程度在不同市场细分中存在异质性。解决方案的关键在于通过多维动态分析与跨指标关联建模,实现对披露质量复杂演化的系统性捕捉。
链接: https://arxiv.org/abs/2606.26522
作者: Nobuhiro Aikawa,Mitsuo Yoshida
机构: 未知
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注: The 4th International Conference on Computational and Data Sciences in Economics and Finance (CDEF 2026)
Abstract:While corporate narrative disclosures provide crucial information to capital markets, comprehensively evaluating their qualitative changes over time remains challenging. Narrative text is inherently multidimensional, meaning that an improvement in one textual dimension often occurs alongside changes in others. To capture these underlying dynamics, we propose a longitudinal text analysis approach combining Japanese-language NLP metric extraction with paired testing, shift function analysis, and inter-metric correlation. Our framework extends prior indicator sets by incorporating a cross-section relevance indicator to measure topical alignment between risk disclosures and management strategies. Applying this approach to evaluate Japan’s 2019 disclosure reforms, we analyze 19,770 firm-year observations over a 10-year period (FY2015-FY2024). The joint analysis reveals complex shifts in disclosure patterns that are frequently masked by conventional single-indicator methods. Specifically, we find that while disclosure volume increased substantially, it was accompanied by a decline in readability. Furthermore, although the overall information structure improved, specific descriptive quality stagnated, and the degree of adaptation varied across market segments.
[NLP-55] mporal Validity in Retrieval Memory: Eliminating Stale-Fact Errors for AI Agents over Evolving Knowledge
【速读】: 该论文旨在解决检索增强生成(Retrieval-augmented generation, RAG)系统在面对动态知识演化时缺乏时间维度建模的问题。具体而言,当事实发生变更(如函数重命名或API重构)时,RAG无法区分过时信息与当前有效信息,导致其在检索过程中同时返回旧值和新值,且二者嵌入向量相似度相近,从而引发错误输出——代理可能选择过时的事实或直接放弃回答。这一问题本质上是结构性的:实验表明,在校准数据集上,余弦相似度区分矛盾事实与重复表述的能力极弱(AUROC仅为0.59,接近随机水平),因为矛盾事实往往比重新表述的副本更接近原始嵌入表示。为此,论文提出MemStrata,一种维护时间有效性的检索记忆机制。其核心创新在于引入双时间戳账本(bi-temporal ledger)与确定性三元组(主体-关系-客体)替代规则,实现对过时事实的自动退役,无需依赖相似度阈值或大模型重排序。在本地部署7B模型的六个基准测试中,MemStrata在静态知识任务上与RAG性能相当,而在动态知识任务中准确率提升至0.95–1.00(相较RAG的0.20–0.47),并显著降低“过时事实错误率”至约0%,而RAG在此类错误上的发生率高达15%–40%。此外,MemStrata的检索延迟约为2.1秒,远优于依赖大模型重排序的基线方案(16–18秒)。研究还开源了评估框架、数据集及无标记评价协议,以支持知识演化场景下的记忆机制研究。
链接: https://arxiv.org/abs/2606.26511
作者: Neeraj Yadav
机构: Called It Inc. (企业); Google(谷歌)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 21 pages, 5 tables. Code, prompts, and evaluation datasets included
Abstract:Retrieval-augmented generation (RAG) gives agents access to accumulated knowledge, but has no model of time. When a fact changes (e.g., a function is renamed or API restructured), RAG retrieves both the stale and current value with near-identical embedding similarity. The agent then either abstains or serves the superseded fact. We show this is a structural problem: on a calibrated dataset, cosine similarity distinguishes a contradicted fact from a duplicated one with AUROC 0.59 (near chance), as contradictions are often more embedding-similar to the original than rephrased duplicates. We present MemStrata, a retrieval memory maintaining temporal validity. It stores facts like RAG, preserving static recall, but when a fact’s value is contradicted, a deterministic (subject, relation, object) supersession rule retires the stale value in a bi-temporal ledger - with no similarity threshold and no LLM call. Across six benchmarks run locally with a 7B model, MemStrata ties RAG on static knowledge and reaches 0.95-1.00 accuracy on evolving knowledge (where RAG reaches 0.20-0.47). The central result is the stale-fact-error rate: when required to answer, RAG serves superseded values 15-40% of the time; MemStrata drives this to ~0%, a failure class RAG cannot avoid. MemStrata achieves this at retrieval latency (~2.1s) versus ~16-18s for LLM-reranking baselines. We release the harness, datasets, and a marker-free evaluation protocol for memory under knowledge evolution. Comments: 21 pages, 5 tables. Code, prompts, and evaluation datasets included Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG) Cite as: arXiv:2606.26511 [cs.CL] (or arXiv:2606.26511v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.26511 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-56] Humans Disengage Reasoning Models Persist: Separating Difficulty Registration from Deliberation Allocation
【速读】: 该论文旨在揭示大推理模型(Large Reasoning Models, LRMs)与人类在面对复杂问题时的决策行为差异,尤其关注其在响应时间(或生成令牌数)与问题难度及正确性之间的关系。核心问题是:尽管LRMs和人类在跨问题层面均表现出“越难的问题耗时越长”的一致性(即跨项目对齐,registration),但当固定具体问题时,两者在应对自身成功与失败试次上的资源分配策略却呈现相反模式——即“组内分配”(allocation)的分歧。解决方案的关键在于区分两个层次的思考过程:一是响应时间如何随问题难度变化(注册机制),二是当问题保持不变时,个体是倾向于为自己的错误投入更多资源还是为成功投入更多资源。研究通过一个公开的匹配人类-模型数据集发现,所有五种思维型LRM均表现出显著的“错误比正确消耗更多令牌”的效应(H-ARC上Cohen’s d = 1.47–3.13),而人类则表现出相反趋势。这一差异在控制问题固定效应后依然成立,并在多个数据集上可重复,且在非思维型基线中不存在。作者解释,人类的行为反映的是“参与—放弃”策略:人们更愿意持续投入预期可解的问题;而LRM的行为则源于不确定性驱动的长度扩展:模型在不确定时会生成更长的推理链,而这恰恰是其出错的时刻。二者共享相同的难度信号,但在终止策略上采取相反控制机制,导致表面一致、内在分裂的现象。因此,传统以“轨迹长度反映难度”作为衡量指标的方法,虽能捕捉到共变关系,却无法揭示这种根本性的元认知控制差异,因而存在局限。
链接: https://arxiv.org/abs/2606.26502
作者: Han-yu Wang
机构: The University of Hong Kong(香港大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large reasoning models (LRMs) take longer on harder problems, just as humans do. This surface similarity hides an opposite pattern within items. When an LRM gets a problem wrong, it spends more tokens than when it gets the same problem right; humans do the reverse, spending less time on the trials they get wrong. We separate two levels of deliberation: how response time tracks difficulty across items (registration), and, with item identity held fixed, whether an agent spends more on its own failures or successes (allocation). On a public matched human-LRM corpus, humans and all five thinking LRMs reproduce the known cross-item alignment (registration) but diverge within items (allocation): every LRM shows a large wrong-vs-right effect (Cohen’s d = 1.47-3.13 on H-ARC) while humans show the opposite sign. The comparison stays inside each agent’s own scale; we never put seconds and tokens on one axis. The dissociation holds under item fixed effects, replicates across datasets, and is absent in a non-thinking baseline. We read the human pattern as engagement versus abandonment: people stay on items they expect to solve and give up on the rest. We read the LRM pattern as length driven by uncertainty: chains grow when the model is unsure, which is exactly when it tends to fail. Both policies produce the same cross-item correlation with difficulty, so they look aligned on the measure prior work has used; the divergence shows up only once item identity is fixed. Under resource-rational metareasoning, the split is between two stopping policies that share a difficulty signal but implement opposite control; trace length captures the signal and misses the control.
[NLP-57] Nemotron-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context
【速读】: 该论文旨在解决现有扩散语言模型(Diffusion Language Models)中因单一网络同时承担上下文表示与迭代去噪任务而导致的性能瓶颈问题。其核心挑战在于,当前方法采用统一网络架构分别处理序列的因果编码与噪声块的双向修正,造成模型在任一功能上的容量受限。为此,本文提出TwoTower框架,其关键创新在于将两个功能解耦为双塔结构:一个冻结的自回归(Autoregressive, AR)上下文塔,负责对原始文本进行单向因果建模;另一个可训练的扩散去噪塔,通过双向块注意力机制与跨注意力连接至上下文塔,实现对噪声块的精细化迭代修正。该设计显著提升了模型在生成效率与质量间的平衡性。基于Nemotron-3-Nano-30B-A3B(一个开源的300亿参数混合Mamba-Transformer MoE模型)进行训练,使用约2.1万亿标记数据,Nemotron-TwoTower在保持98.7%自回归基线质量的同时,实现了2.42倍于基线的实时生成吞吐量,验证了该解耦策略的有效性。
链接: https://arxiv.org/abs/2606.26493
作者: Fitsum Reda,John Kamalu,Roger Waleffe,Mostofa Patwary,Mohammad Shoeybi,Bryan Catanzaro
机构: 未知
类目: Computation and Language (cs.CL)
备注: Code and model weights available at this https URL
Abstract:Diffusion language models offer a promising alternative to autoregressive models due to their potential for parallel and iterative generation. However, existing approaches use a single network for both context representation and iterative denoising, forcing one model to serve both roles and limiting its capacity for either role. We propose TwoTower, a block-wise autoregressive diffusion model that decouples these roles into two towers: a frozen AR context tower that causally processes clean tokens, and a trainable diffusion denoiser tower with bidirectional block attention that refines noisy blocks via cross-attention to the context. Built on Nemotron-3-Nano-30B-A3B, an open-weight 30B hybrid Mamba-Transformer MoE model, and trained on approximately 2.1T tokens, Nemotron-TwoTower retains 98.7% of the autoregressive baseline’s quality while offering 2.42X higher wall-clock generation throughput. We release the code and model weights at this https URL.
[NLP-58] Comparing BERT Sentence-Pair Classification and Few-Shot LLM Prompting for Detecting Threat and Solution Framing in German Climate News
【速读】: 该论文旨在解决气候新闻报道中框架(framing)模式的自动化识别问题,即如何在句子层面自动区分德语气候新闻文章中的威胁导向(threat-oriented)与解决方案导向(solution-oriented)内容,以支持对大规模新闻语料的高效分析。其核心解决方案在于系统比较两种方法:一是基于少样本提示(few-shot prompting)的开源大语言模型(Llama 4 Maverick),采用链式思维(chain-of-thought)推理与结构化输出并附带置信度评分;二是对德语BERT模型(deepset/gbert-large)进行微调,构建句对分类器,利用前一句作为上下文信息增强目标句的分类能力。关键发现表明,微调后的BERT模型在威胁与解决方案两类任务上均达到0.83的F1分数,显著优于大语言模型的0.78,且消融实验验证了引入上下文信息可显著提升分类性能。这一结果为计算社会科学中微调编码器模型与提示生成模型在文本分类任务中的适用性提供了实证依据。
链接: https://arxiv.org/abs/2606.26489
作者: Raven Adam,David Maier,Marie Kogler
机构: University of Graz(格拉茨大学); Technical University of Graz(格拉茨工业大学)
类目: Computation and Language (cs.CL)
备注: 15 pages
Abstract:News media play a central role in shaping public perceptions of climate change, and whether coverage emphasizes threats or solutions has measurable effects on audience engagement and policy support. Automated detection of these framing patterns at the sentence level would allow researchers to analyze large corpora that are infeasible to code manually. We present a systematic comparison of two approaches for classifying sentences from German-language climate news articles as threat-oriented, solution-oriented, both, or neither. The first approach uses few-shot prompting with an open-weights large language model (Llama 4 Maverick), employing chain-of-thought reasoning and structured output with confidence scoring. The second approach fine-tunes a German BERT model (deepset/gbert-large) for sentence-pair classification, where the preceding sentence provides contextual information for the target sentence. Both approaches implement two independent binary classifiers, one for threat framing and one for solution framing. We evaluate both methods on a corpus of 440 Austrian newspaper articles that were manually coded following a detailed coding scheme developed with domain experts. The fine-tuned BERT classifiers achieve an F1 score of 0.83 for both the threat and solution tasks, while the LLM-based classifiers reach an F1 of 0.78. An ablation study confirms that providing the preceding sentence as context improves BERT classification performance substantially compared to single-sentence input. These results contribute to the growing body of work comparing fine-tuned encoder models with prompted generative models for text classification in computational social science.
[NLP-59] Speaking Numbers to LLM s: Multi-Wavelet Number Embeddings for Time Series Forecasting IJCAI2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在上下文感知的时间序列预测中,因离散的语言导向标记化与嵌入接口与连续数值数据之间存在语义错配,导致数值顺序信息丢失、预测可靠性下降的问题。其核心解决方案是提出一种即插即用的时序小波数字接口(TempoWave),通过多小波、多尺度系数构建逐位数字嵌入,将每个标量观测值映射为具有精细局部波动与宏观全局结构表征的嵌入表示。该方法直接替代标准的数值标记化方式,在保持精确数值格式、区分数字身份以及对常见归一化操作鲁棒性的前提下,以Transformer兼容的形式实现对时间序列数值特性的有效建模。实验结果表明,TempoWave在五个富上下文的时间序列预测基准上均显著优于传统数值标记化及其它嵌入接口,实现了新的性能上限,验证了数值接口作为关键瓶颈的重要性,并表明基于多分辨率的有原则嵌入可更有效地融合LLMs的上下文推理能力与精确预测需求。
链接: https://arxiv.org/abs/2606.26487
作者: Defu Cao,Zijie Lei,Muyan Weng,Jiao Sun,Yan Liu
机构: University of Southern California (南加州大学); Meta (元); Google DeepMind (谷歌深度思维)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Camera Ready version of IJCAI 2026
Abstract:Large language models (LLMs) are attractive for context-aware time series forecasting because they can integrate heterogeneous textual signals, yet their discrete, language-oriented tokenization and embedding interfaces are misaligned with continuous numerical values, often harming numerical ordering and forecasting reliability. We propose TempoWave, a plug-and-play temporal wavelet digit interface that maps each scalar observation into digit-wise embeddings constructed from multi-wavelet, multi-scale coefficients. By directly overriding standard token representations, TempoWave seamlessly exposes both fine-grained local fluctuations and macro global structures in a transformer-compatible form, ensuring that precise numerical formatting, distinct digit identity, and robustness to common normalization operations are maintained throughout the LLM pipeline. Experiments across five context-enriched forecasting benchmarks demonstrate that TempoWave consistently improves LLM-based forecasters over standard numeric tokenization and alternative embedding interfaces, achieving a new state-of-the-art. These results highlight the numeric interface as a key bottleneck and suggest that principled multi-resolution embeddings can better couple LLMs’ contextual reasoning with precise forecasting. Our code is available at this https URL and our model can be accessed at this https URL.
[NLP-60] Adaptive Evaluation of Out-of-Band Defenses Against Prompt Injection in LLM Agents
【速读】: 该论文旨在解决工具使用型大语言模型(LLM)代理在面对间接提示注入攻击时的安全性问题,其核心挑战在于现有防御机制普遍依赖于模型内部的“带内”检测与拒绝策略,而这类方法在遭遇自适应、具备防御意识的攻击时极易失效。为此,论文提出的关键解决方案是采用“带外”(out-of-band)的确定性策略来实现安全防护,即通过外部的确定性策略(如能力控制、信息流标签和参考监控器)对代理行为进行强制干预,而非依赖模型自身判断是否执行恶意指令。这一思路在CaMeL、FIDES、Progent、RTBAS和FORGE等系统中已得到验证,可在静态基准测试(AgentDojo)上近乎消除攻击成功率。论文进一步将这些带外防御统一归类为经典完整性保护(Biba)、参考监控与最小权限原则的实例,从而构建起系统化的比较框架,揭示其覆盖范围与局限性。更重要的是,作者指出当前所有相关研究均仅在静态基准上验证,缺乏对抗自适应攻击的能力评估;为此,论文明确提出了一个适应性评估威胁模型与实验协议,并基于此对Progent的分析进行了独立复现与扩展——在未被原作者测试的单卡H200部署环境、使用开源权重模型Qwen2.5-7B的条件下,即使引入手工设计的自适应攻击,防御仍保持稳定(平均攻击成功率从25.8%降至4.2%,自适应攻击未显著提升至2.6%)。尽管该结果仅为弱模型下的小规模数据点,且尚未应对白盒优化攻击(如基于GCG的生成式攻击),但初步支持了“确定性的带外强制执行比带内检测更难被自适应攻击突破”的假设,为未来可信代理系统的安全设计提供了关键方向。
链接: https://arxiv.org/abs/2606.26479
作者: Praneeth Narisetty,Shiva Nagendra Babu Kore,Uday Kumar Reddy Kattamanchi,Jayaram Kumarapu
机构: LaunchSafe Research
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 12 pages, 5 figures, 4 tables
Abstract:Recent work (2024 to 2026) has converged on a strategy for defending tool-using LLM agents against indirect prompt injection: rather than training the model to refuse malicious instructions, enforce security outside the model with a deterministic policy that mediates the agent’s actions. Systems such as CaMeL, FIDES, Progent, RTBAS, and FORGE realize this with capabilities, information-flow labels, and reference monitors, and several report near-elimination of attacks on the AgentDojo benchmark. We make two contributions. First, we organize these out-of-band defenses as instances of classical integrity protection (Biba), reference monitoring, and least privilege, yielding a structured comparison of what they do and do not cover. Second, we warn that every one of them is validated only on static benchmarks (a fixed set of injection attempts), the same methodology that made in-band defenses look strong until adaptive, defense-aware attacks broke twelve of them at over 90% success; we specify the threat model and protocol an adaptive evaluation requires. We then run that protocol as an independent reproduction and extension of Progent’s own adaptive-attack analysis, on AgentDojo, with an open-weight agent (Qwen2.5-7B) self-hosted on a single H200, a setting its authors did not test. Averaged over three runs, the defense held: Progent cut mean attack success roughly sixfold (25.8% to 4.2%), and a hand-crafted adaptive attack did not raise it (2.6%). This is one small-scale data point on a weak model with a single black-box attack template; a stronger optimized (white-box GCG) attack remains open. The result is consistent with, but does not establish, the hypothesis that deterministic out-of-band enforcement is a harder target for an adaptive attacker than in-band detection.
[NLP-61] Epiphany-Aware KV Cache Eviction Without the Attention Matrix
【速读】: 该论文旨在解决生成式 AI(Generative AI)在长链思维推理过程中,因键值缓存(KV cache)规模急剧膨胀而导致的部署瓶颈问题。现有缓存淘汰方法依赖注意力权重作为重要性排序依据,但该指标在长推理轨迹中噪声较大,且需显式构建注意力矩阵,阻碍了融合内核(fused kernels)在生产推理中的应用。本文提出一种名为“顿悟得分”(epiphany score)的新评分机制,通过直接读取前向传播中模型内部表示的变化来衡量令牌的重要性,无需计算注意力矩阵且仅引入可忽略的额外状态。基于此设计的EpiKV缓存淘汰策略无需训练、分类器或定制内核,可无缝集成至FlashAttention推理栈中,实现16倍于传统基于注意力的评分方法的有效上下文长度扩展。实验表明,在4096令牌缓存下,EpiKV在MATH-500上达到72%准确率,与最强基线(ThinKV 71%,H2O 67%)相当;其滞后归一化变体在8192令牌缓存下于AIME-2024上达到37%准确率,优于最佳基线(33%),且推理速度提升达2.8倍。解决方案的关键在于利用模型内部表示的动态变化作为更稳定、高效的缓存淘汰信号,同时保持与现有高效推理架构的兼容性。
链接: https://arxiv.org/abs/2606.26472
作者: Steven Kolawole,Virginia Smith
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Preprint; in review
Abstract:As reasoning models emit chains of thought tens of thousands of tokens long, KV cache increasingly becomes a deployment bottleneck. Existing cache eviction methods rank tokens by attention weight, which is a noisy importance proxy in long reasoning traces, and prohibits the use of fused kernels in production inference by forcing the model to materialize the attention matrix. In this work, we instead score tokens with a metric we term the epiphany score: the change in the model’s internal representation, read directly from the forward pass with no attention matrix and negligible extra state. Our resulting cache eviction method, EpiKV, requires no training, classifier, or custom kernel, and can be used directly in FlashAttention inference stacks unchanged – scaling to a 16x longer feasible context than attention-based scoring. upper-mid layers negatively) and remove a positional trend with a causal rolling z-score. At a 4096-token cache EpiKV reaches 72% on MATH-500, matching the strongest attention-based baseline (ThinKV 71%, H2O 67%); a lag-normalized KV variant reaches 37% on AIME-2024 at 8192 tokens against the best of them (33%), at up to 2.8x the speed.
[NLP-62] Soft Token Alignment for Cross-Lingual Reasoning
【速读】: 该论文旨在解决多语言大语言模型在处理语义等价的跨语言提示时,因语言特异性词汇选择导致推理路径不一致、输出结果差异显著的问题。其核心挑战在于:尽管中间表示具有一定的语言无关性,但生成阶段一旦锁定离散输出词元(token),便迅速产生语言依赖性,从而破坏了跨语言推理的语义一致性。为解决此问题,论文提出SOLAR(Soft-Token Alignment for Reasoning),一种基于监督微调的辅助目标,通过以英语为枢纽(pivot),对非英语语言的软词元(soft token)表示与对应英语表示进行对齐。软词元是词表嵌入的概率加权混合,能够连续地聚合跨语言的语义相关信息,避免对单一词汇或书写系统的过度依赖。实验表明,在四个多语言推理基准上,SOLAR相较于基线模型提升最高达+17.7个百分点,优于标准监督微调+3.8个百分点,尤其在低资源语言上表现突出。此外,SOLAR显著增强了最终层的跨语言相似性,并大幅降低语言聚类可分性,证明对软词元表示的对齐有助于在多语言推理过程中保持共享的语义结构。
链接: https://arxiv.org/abs/2606.26466
作者: Jiayi He,Jungsoo Park,Wei Xu,Alan Ritter
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Multilingual large language models often produce inconsistent reasoning and answers for semantically equivalent prompts in different languages. Prior work suggests that intermediate representations can be relatively language-agnostic, but generation becomes increasingly language-specific as models commit to discrete output tokens. This is problematic because language-specific lexical choices can cause semantically equivalent reasoning paths to diverge across languages. These divergences motivate searching for a cross-lingual alignment signal that is less tied to any single vocabulary item or script. We propose SOLAR, an auxiliary objective for supervised fine-tuning that aligns soft-token representations across languages, using English as a pivot. Soft tokens are probability-weighted mixtures over the vocabulary embeddings, yielding continuous representations that can aggregate information from semantically related tokens across languages. We then align each non-English soft-token summary to its English counterpart in the shared embedding space. Across four multilingual reasoning benchmarks, SOLAR improves accuracy by up to +17.7 points over the base model and +3.8 over standard supervised fine-tuning, with the largest gains on low-resource languages. SOLAR also strengthens final-layer cross-lingual similarity and substantially reduces language-cluster separability, suggesting that aligning soft-token representations helps preserve shared semantic structure during multilingual reasoning.
[NLP-63] AnySimLite: A Lightweight Few-Shot Similarity Encoder for On-Device Speech-Adjacent Classification INTERSPEECH2026
【速读】: 该论文旨在解决在边缘设备(如智能手机)上部署多个专用轻量级模型所面临的内存占用高与隐私泄露风险之间的矛盾,特别是在自然语言分类任务中,如何以极低的内存开销实现多任务泛化能力。其核心问题在于:能否通过将多种语音相关(Speech-Adjacent, SA)分类任务统一建模为细粒度文本相似性计算,从而用单一轻量级架构替代多个专用模型?解决方案的关键在于提出AnySimLite——一种融合词级与字符级双通道表征的轻量级相似性编码器,并结合数据集转换策略,在少样本场景下实现了对多种SA任务的高效统一建模。实验表明,AnySimLite在保持极低内存占用(仅为SOTA模型qLLaMA_LoRA-7B的1/250)的同时,仍能维持接近或超越现有最优性能,最差情况下性能下降低于7%,显著提升了边缘端多任务自然语言处理的实用性与可扩展性。
链接: https://arxiv.org/abs/2606.26452
作者: Sourav Ghosh,Yash Bhatia,Keshav Goyal,Sahil Singh Bagri,Mohamed Akram Ulla Shariff,Saravana Balaji Shanmugam
机构: Samsung RD Institute Bangalore(三星研究院班加罗尔), India
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted at Interspeech 2026
Abstract:To minimize privacy concerns and inference latency on edge devices like smartphones, lightweight on-device models remain important for end-user applications. Many of these applications involve natural language classification, but deploying multiple specialized models creates a memory footprint challenge. We investigate: Can a single lightweight architecture solve multiple Speech-Adjacent (SA) classification tasks through reduction to a nuanced text similarity formulation? We propose AnySimLite, a lightweight similarity encoder that combines word-level and character-level channels. Together with a dataset transformation strategy, we evaluate AnySimLite across multiple SA classification tasks and show that it consistently achieves state-of-the-art (SOTA) or SOTA-competitive performance in few-shot settings while maintaining a low memory footprint. Even in the worst case, the performance drop remains below 7% while using \frac1250^\mathrmth of the model size of the SOTA qLLaMA_LoRA-7B baseline.
[NLP-64] ConflictScore: Identifying and Measuring How Language Models Handle Conflicting Evidence
【速读】: 该论文旨在解决现有事实性(factuality)与忠实性(faithfulness)评估指标在面对支持性与矛盾性证据共存时无法有效识别冲突的问题。传统方法仅判断答案是否被文档支持或反驳,忽略了当两者同时存在时模型是否如实反映这种不确定性。其解决方案的关键在于提出一种新度量指标——ConflictScore,通过将模型回答分解为原子级主张(atomic claims),逐项标注其与每份依据文档之间的支持或矛盾关系,并在此基础上构建两个互补的量化指标:ConflictScore-Count(CS-C,即存在冲突的主张比例)和ConflictScore-Ratio(CS-R,即支持与矛盾证据间的平衡程度)。为此,研究团队还构建了ConflictBench基准测试集,涵盖模糊性、矛盾性及观点分歧等多种冲突形式,以系统评估该指标的有效性。实验表明,ConflictScore能够有效识别跨领域中过度自信的陈述,并可作为纠正反馈机制显著提升TruthfulQA数据集上的回答真实性。
链接: https://arxiv.org/abs/2606.26437
作者: Siyi Liu,Aaron Halfaker,Dan Roth,Patrick Xia
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing metrics for factuality and faithfulness evaluate whether an answer is supported or contradicted by its grounding documents, but they fail to capture when both supporting and contradicting evidence coexist. We introduce ConflictScore, a novel metric that quantifies how well a model’s response acknowledges conflicting evidence in its grounding documents. Our framework decomposes responses into atomic claims, labels each claim against each grounding document, and then aggregates these labels into two complementary measures: ConflictScore-Count (CS-C), the proportion of claims exhibiting conflicts, and ConflictScore-Ratio (CS-R), the balance between supporting and contradicting evidence. We develop ConflictBench, a benchmark covering diverse forms of conflicts such as ambiguity, contradiction, and divergent opinions, to systematically evaluate our metric. Experiments show that ConflictScore effectively detects overconfident claims across domains and can serve as a corrective feedback mechanism that improves truthfulness on TruthfulQA.
[NLP-65] DualEval: Joint Model-Item Calibration for Unified LLM Evaluation
【速读】: 该论文旨在解决当前大语言模型(LLM)评估中静态基准测试与基于对战(arena-style)偏好数据之间信号脱节的问题。静态基准提供客观的正确性标签,而对战式偏好数据更贴近开放域用户交互,二者虽互补却常被孤立使用。其解决方案的关键在于提出一种双路径校准框架(DualEval),通过将模型能力与评估项难度、锐度(sharpness)在共享潜在空间中联合建模,实现模型与评估项的协同校准。该方法利用18个前沿大模型在编码、数学、领域知识及通用日常查询四个领域的数据,结合静态基准标签与经人工偏好验证的奖励模型评分,实证表明该框架可生成可靠且均衡的模型排名,并支持下游应用如基准压缩(提升评估效率)与异常检测(识别污染或离群样本)。总体而言,DualEval通过联合建模实现了静态评估与对战式评估的统一,为构建更高效、可解释且可审计的评估流程提供了新范式。
链接: https://arxiv.org/abs/2606.26429
作者: Aaron J. Li,Hao Huang,Youngmin Park,Yitong Ma,Wei-Lin Chiang,Li Chen,Cho-Jui Hsieh,Bin Yu,Ion Stoica
机构: University of California, Berkeley (加州大学伯克利分校); Arena (竞技场); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Current LLM evaluation relies on two complementary but often disconnected signals: static benchmarks with objective correctness labels and arena-style preference data that better reflect open-ended user interactions. We introduce DualEval, a latent model-item calibration framework that represents models and evaluation items in a shared space, jointly estimating model ability together with item difficulty and sharpness. We apply DualEval across four domains: coding, math, miscellaneous domain-knowledge tasks, and generic everyday user queries. Our evaluation uses 18 frontier LLMs, static benchmark labels, and reward-model scores validated against held-out human preferences for open-ended model responses. Empirically, our framework produces reliable and balanced model rankings, and its learned item-level profiles support downstream applications such as benchmark compression for sample-efficient evaluation and anomaly detection for contamination or outlier analysis. Overall, DualEval unifies static and arena-style evaluation through joint model-item calibration, producing model rankings and item-level diagnostics that support more sample-efficient, interpretable, and auditable evaluation pipelines.
[NLP-66] ProfileFoundry: A Synthetic Person-Object Substrate for Privacy Memory and Tool-Use Evaluation in LLM Agent
【速读】: 该论文旨在解决基础模型研究中对真实用户数据的依赖与数据使用责任之间的矛盾问题,即在需要模拟人类个体状态、历史、关系、文档及时间序列更新等复杂个人信息时,真实数据难以安全共享、修改或审计,而传统生成的虚假数据又缺乏跨字段一致性与时间连贯性,无法支持可控评估。其解决方案的关键在于提出PROFILEFOUNDRY——一个确定性合成生成器与固定参考发布版本,构建了10万份跨八个地理区域的成人合成人物对象(synthetic Person Objects),每个对象包含类型化的当前快照、家庭与雇主关联、对齐事件、规范化的关系视图及生成溯源信息。该数据集通过709,228条事件、40,338个家庭、52,491个雇主和518,564条有向关系边,实现了全局范围内的引用完整性与时间闭包,并通过多维度验证(如人口边际对比、对象级不变量检验、巧合与溯源筛查)确保数据质量。不同于人口保真模型、文本语料库或形式化隐私保护机制,PROFILEFOUNDRY作为负责任的合成数据层,为记忆建模、隐私理解、文档解析、记录链接及代理状态评估等下游任务提供可追溯、可审查的合成个体实体,保障每项人工产物背后的“合成人”具备可解释性与可审计性。
链接: https://arxiv.org/abs/2606.26403
作者: Sriram Selvam,Anneswa Ghosh
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Foundation-model research increasingly needs data about people: user state, personal histories, relationships, contact-like fields, documents, and longitudinal updates. Real user data is difficult to share, perturb, audit, or redistribute responsibly, while independently generated fake fields rarely preserve the cross-field and temporal consistency needed for controlled evaluation. We present PROFILEFOUNDRY, a deterministic generator and fixed reference release of 100,000 adult synthetic Person Objects across eight locales. Each object combines a typed current snapshot, household, family, and employer links, snapshot-aligned events, normalized relational views, and generation provenance. The release contains 709,228 events, 40,338 households, 52,491 employers, and 518,564 directed relationship edges. We report evidence in separate categories: selected population-marginal comparisons, per-object invariant checks, release-wide referential and temporal closure, and coincidence/provenance screens. PROFILEFOUNDRY is not a population-fidelity model, a rendered-text corpus, or a formal privacy mechanism. Instead, it is a responsible synthetic source layer for constructing downstream foundation-model evaluations involving memory, privacy, document understanding, record linkage, and agent state while keeping the synthetic person behind each artifact inspectable
[NLP-67] Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLM s ECCV2026
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在联合图文推理过程中易产生与视觉输入矛盾的幻觉问题。其核心挑战在于模型存在“视觉惰性”(visual laziness):尽管模型内部已编码正确的视觉信息,但在生成响应时过度依赖强大的语言先验,导致忽视或忽略视觉证据。现有对齐方法(如直接偏好优化)主要基于文本层面的结果奖励进行优化,引入了对语言捷径的优化偏差,进一步加剧了幻觉现象。为此,本文提出一种基于强化学习(Reinforcement Learning, RL)的后训练框架——视觉信息增益对齐(Visual Information Gain In aLignment, VIGIL),其关键创新在于将优化目标从数值奖励拟合转向因果视觉锚定(causal visual grounding)。VIGIL通过引入几何约束,显式最大化视觉输入与生成响应之间的互信息,并惩罚在文本-视觉注意力被遮蔽以构造反事实盲态时仍表现出不恰当确定性的“盲自信”情形。实验表明,VIGIL在减少幻觉和提升推理能力方面显著优于现有对齐方法,且在仅使用25%偏好数据的情况下达到顶尖方法的全量数据性能,甚至在未接受边界框监督的情况下展现出涌现的空间定位能力。
链接: https://arxiv.org/abs/2606.26387
作者: Xi Xiao,Chen Liu,Chih-Ting Liao,Yunbei Zhang,Qizhen Lan,Yuxiang Wei,Lin Zhao,Janet Wang,Jianyang Gu,Muchao Ye,Tianyang Wang,Hao Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ECCV 2026
Abstract:Multimodal large language models (MLLMs) extend large language models (LLMs) with visual perception, enabling joint reasoning over images and text. Despite inheriting strong reasoning capabilities from LLMs, they remain prone to hallucinations that contradict their visual inputs. Mechanistic studies indicate that this weakness stems from visual laziness: MLLMs encode the correct visual evidence internally, but overly rely on strong language priors during response. Existing alignment methods, such as direct preference optimization, primarily optimize outcome-level rewards based on text. This introduces an optimization bias toward linguistic shortcuts, leading to responses that often contradict the visual evidence. To address this, we propose Visual Information Gain In aLignment (VIGIL), a reinforcement-learning (RL) post-training framework that shifts the focus from numerical reward fitting to causal visual grounding. VIGIL introduces a geometric constraint that explicitly maximizes the mutual information between the visual input and the generated response. We achieve this by penalizing “blind confidence” instances where the model remains improperly certain even when textual-visual attention is masked to create a counterfactual blind state. Extensive experiments show that VIGIL consistently outperforms recent alignment methods across hallucination and reasoning benchmarks without compromising text-only capabilities. Our approach matches the full-data performance of state-of-the-art methods using only 25% of the preference data and even demonstrates emergent spatial grounding capabilities without explicit bounding box supervision.
[NLP-68] Narration-of-Thought: Inference-Time Scaffolding for Defeasible Ethical Reasoning in Large Language Models ACL2026
【速读】: 该论文旨在解决标准思维链(Chain-of-Thought, CoT)在处理道德困境时存在的两种失效模式:利益相关方坍缩(stakeholder collapse,即仅识别单一利益相关方)和不确定性抑制(uncertainty suppression,即在决策前未明确表达未知或保留意见)。其解决方案的关键在于提出“思维叙述”(Narration-of-Thought, NoT)系统提示框架,将思维链结构化为五个明确部分:主角(protagonist)、利益相关方(stakeholders)、两步后果(two-step consequences)、不确定性(uncertainty)以及最终承诺(commitment)。NoT无需额外训练、参数调整或微调,仅通过提示工程实现。在涵盖四个生成器、来自三个供应商的100个DailyDilemmas场景中,NoT将利益相关方坍缩率从最高31%降至不足1%,不确定性抑制率从最高72%降至1%-24%。与同等预算的冗长思维链对照组相比,NoT在利益相关方数量和不确定性评分上仍保持显著优势(提升0.79至0.90及0.65至0.93),且消融实验表明各子指令对性能提升具有可归因贡献。进一步采用基于文本梯度下降的初始化策略优化该框架,并引入跨厂商训练评估者,结果优于同厂商评估者。将该方法扩展至五轮多利益相关方辩论协议后,争议解决率从6%跃升至95%共识(校准集)及100%联合收敛(复制测试集)。最终生成的推理轨迹显式呈现了每个决策所依据的利益相关方、后果推演与不确定性来源,为可信赖的代理系统部署提供了可审计的推理基底。
链接: https://arxiv.org/abs/2606.26366
作者: Patrick Cooper,Alvaro Velasquez
机构: University of Colorado Boulder (科罗拉多大学博尔德分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 24 pages, 8 figures, 16 tables. To appear at ACL 2026 (submitted via ARR)
Abstract:Standard chain-of-thought on moral dilemmas exhibits two failure modes: stakeholder collapse (the trace names at most one party with a stake in the outcome) and uncertainty suppression (no explicit unknowns or hedges before committing to an action). We introduce narration-of-thought (NoT), a system prompt that structures chain-of-thought into five sections: protagonist, stakeholders, two-step consequences, uncertainty, then commitment. NoT adds no training, parameters, or fine-tuning. On 100 DailyDilemmas scenarios across four generators from three vendors, NoT cuts stakeholder collapse from up to 31% to under 1% and uncertainty suppression from up to 72% to 1-24% on every model. A matched-budget verbose-CoT control rules out token spend as the active ingredient; NoT retains Cliff’s delta advantages of +0.79 to +0.90 on stakeholder count and +0.65 to +0.93 on uncertainty score for three of four generators, and a section ablation attributes each shift to its specific sub-instruction. Textual-gradient descent initialised at NoT improves the scaffold further; a cross-family training judge (different vendor from the generator) dominates an in-family one on every measured axis. Extended to a five-round multi-stakeholder debate protocol, the scaffold converts a 6% standoff into 95% full consensus on a calibration set and 100% combined convergence on a DailyDilemmas replication. The resulting traces externalise the stakeholders, consequences, and uncertainty grounding each commitment, providing an auditable substrate for dependable agentic deployment.
[NLP-69] Phonetic and semantic analyses of spoken corpora of Beijing and Taiwan Mandarin indicate that the neutral tone is a lexical tone
【速读】: 该论文旨在解决普通话中“轻声”(neutral tone)这一语音现象的本质问题,特别是其在音系学上是否具有独立的声调目标及其在不同方言变体中的实现差异。传统观点认为轻声是弱化或无调的,缺乏固定音高特征,但本文通过基于自然对话语料库(北京普通话与台湾普通话)的实证研究,提出轻声并非缺失声调,而是具有自身特定的音高目标,与普通话四个词汇声调一样具备可识别的声调属性。其关键解决方案在于:首先,证明双音节词中带有轻声的音节具有受前字声调影响的音高轮廓,与带词汇声调的双音节词模式一致;其次,揭示轻声词具有词特异性音高特征(word-specific pitch signatures),这些特征可通过上下文嵌入表示(contextualized embeddings)部分预测,表明其具有词汇-音系层面的稳定性。此外,研究还发现两岸普通话在轻声实现上的差异,部分可归因于词语在不同语料中使用的语义差异,从而支持轻声在两种变体中均为词汇声调的论断。
链接: https://arxiv.org/abs/2606.26360
作者: Yuxin Lu,Zhexuan Li,R. Harald Baayen
机构: Eberhard Karls Universität Tübingen (图宾根大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The neutral, or floating, tone of Mandarin Chinese is a tone with an enigmatic set of properties. It has been described as a reduced tone, or as a tone that sometimes is lexically fixed but that can also be toneless. In two-syllable words, it is found only on the second syllable, but single-syllable words can also have the neutral tone. We present a corpus-based study of the phonetic realization of the neutral tone in spontaneous conversational speech corpora of Beijing Mandarin and Taiwan Mandarin. We show that the neutral tone has its own tonal target, just as the four lexical tones of Mandarin. We also show that disyllabic words with a neutral tone have pitch contours that have a pitch component that depends on the tone on the first syllable, just as has been observed for two-syllable words with a lexical tone on the second syllable (Chuang et al., 2026). Furthermore, words with a floating tone have word-specific pitch signatures, which have also been documented for single-syllable words (Jin et al., 2026) as well as two-syllable words (Lu et al., 2026b). These word-specific pitch signatures are shown to be predictable to some extent from words’ contextualized embeddings, as previously reported for lexical tones (Chuang et al., 2026; Lu et al., 2026b). As there is also considerable variability in the realization of lexical tones, we propose that the neutral tone is, in fact, a lexical tone in both Taiwan Mandarin and Beijing Mandarin. We document both similarities and differences in the realization of the floating tone in these two varieties and provide evidence, using contextualized embeddings, that some of the observed differences may arise from differences in the meanings of the words as used in the two corpora.
[NLP-70] Axon: A Synthesizing Superoptimizer for Tensor Programs
【速读】: 该论文旨在解决在人工智能加速器上编写高性能张量计算内核时,程序员需具备深厚的调优知识(如分块策略、指令选择、数据布局优化及算子融合)所带来的巨大负担。其核心问题在于如何自动化实现从语义规范到高效目标指令的映射,并在众多语义等价的程序变体中自动搜索最优性能方案。解决方案的关键在于提出Axon——一种面向张量程序的合成式超优化器:它通过程序综合技术,从语义规格自动生成目标指令;利用计算图中的算子传播发现代数变换,并基于带有无界张量的SMT求解器验证所有变换的语义保持性,无需人工设计重写规则;随后将张量操作降低为目标指令集架构(ISA)指令,结合硬件描述约束探索分块配置,并通过算子与指令融合最大限度减少内存访问开销,从而实现端到端的高性能内核生成。
链接: https://arxiv.org/abs/2606.26344
作者: Akash Kothari,Shaowei Zhu,Daniel Kroening,Chungha Sung
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄本那-香槟分校); Amazon Web Services (亚马逊网络服务)
类目: Programming Languages (cs.PL); Computation and Language (cs.CL); Performance (cs.PF)
备注:
Abstract:Writing high performance kernels for AI accelerators requires deep expertise in tiling, instruction selection, data layout, and operator fusion placing a significant burden on programmers. In this paper, we focus on tile based AI accelerator programs and present Axon, a synthesizing superoptimizer for tensor programs: it uses program synthesis to automatically generate target instructions from semantics specifications, and explores semantically equivalent program variants to select the best performing kernel empirically. Axon discovers algebraic transformations by propagating operators through computation graphs and uses SMT over unbounded tensors to guarantee that all transformations preserve semantics without requiring hand crafted rewrite rules. It then lowers tensor operations to target ISA instructions, explores tiling configurations constrained by hardware descriptions, and fuses operators and instructions to minimize memory traffic.
[NLP-71] he Verification Horizon: No Silver Bullet for Coding Agent Rewards
【速读】: 该论文旨在解决生成式代码代理(coding agents)在能力不断提升背景下,验证(verification)环节日益成为瓶颈的核心问题。随着基础模型推理能力增强及工程手段日趋复杂,生成复杂候选解已不再困难,但如何可靠地验证这些解是否符合人类意图却变得愈发困难。其关键挑战在于:一方面,人类意图本质上具有不明确性(underspecified),导致难以精准判断目标是否达成;另一方面,在模型训练过程中,优化过程会加剧验证信号与真实意图之间的偏差,表现为奖励劫持(reward hacking)或信号饱和。为此,论文提出从可扩展性(scalability)、忠实性(faithfulness)和鲁棒性(robustness)三个维度评估验证信号的质量,并强调三者协同实现是核心难题。研究进一步设计了四种不同场景下的奖励构造方法:通用编程任务的测试验证器、前端任务的评分标准验证器、真实任务中以用户为验证者的机制,以及面向长周期任务的自动化代理验证器。通过在多种任务类型与策略能力水平下的深入分析与实验,揭示了奖励设计中的核心挑战及其优化路径。实验表明,针对性的验证设计能有效抑制奖励劫持、提升任务完成质量,并在多个内部与公开基准上取得显著性能增益。由此得出的关键结论是:随着策略能力持续进化,任何固定的奖励函数都无法长期有效,验证机制必须与生成器共同演化。
链接: https://arxiv.org/abs/2606.26300
作者: Binghai Wang,Chenlong Zhang,Dayiheng Liu,Jiajun Zhang,Jiawei Chen,Mouxiang Chen,Rongyao Fang,Siyuan Zhang,Xuwu Wang,Yuheng Jing,Zeyao Ma,Zeyu Cui
机构: Alibaba Cloud (阿里云)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Authors are listed alphabetically by their first names
Abstract:A classical intuition holds that verifying a solution is easier than producing one. For today’s coding agents, this intuition is being inverted: as foundation models develop stronger reasoning capabilities and engineering harnesses grow more sophisticated, generating complex candidate solutions is no longer difficult – reliably verifying them has become the harder problem. Every verifier we can build is only a proxy for human intent, never the intent itself. This makes verification subject to a twofold difficulty: first, intent is underspecified by nature, making it inherently hard to faithfully check whether it has been fulfilled; second, during model training, optimization widens the gap between proxy and intent – manifesting as reward hacking or signal saturation. To address this, we characterize the quality of verification signals along three dimensions – scalability, faithfulness, and robustness – and argue that achieving all three simultaneously is the central challenge. We further study four reward constructions: a test verifier for general coding tasks, a rubric verifier for frontend tasks, the user as verifier for real-world agent tasks, and an automated agent verifier for long-horizon tasks. Across different task types and policy capability levels, we conduct in-depth analysis and experiments on the core challenges of reward design and how to more effectively leverage reward signals. Experiments show that targeted verification design can effectively suppress reward hacking, improve task completion quality, and achieve significant gains across multiple internal and public benchmarks. These experiences collectively point to a core observation: no fixed reward function can remain effective as policy capability continues to grow; and verification must co-evolve with the generator.
[NLP-72] From Structure to Synergy: A Survey of Vision-Language Perception Paradigm Evolution in Multimodal Large Language Models
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)研究中缺乏从真正统一的视觉-语言视角系统性审视感知能力的问题。现有综述往往割裂地关注视觉或语言模态,未能充分揭示二者融合演进为整合感知能力的跨模态动态过程。其解决方案的关键在于:首先,将MLLM的感知能力形式化为一种内在的、不可分割的统一视觉-语言能力,类比于人类与生俱来的感知机制;其次,提出一个五阶段分类框架,系统梳理MLLM感知范式演进历程,并对各阶段的代表性方法与里程碑成果进行归纳;最后,识别当前开放挑战并展望迈向真正通用、统一的多模态智能的潜在研究方向。该工作为实现人工通用智能(AGI)提供了理论基础与可操作的发展路线图。
链接: https://arxiv.org/abs/2606.26196
作者: Haoxiang Sun,Tao Wang,Li Yuan,Jian Zhao,Jiancheng Lv
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have recently made remarkable progress in unifying vision-language understanding and reasoning, especially following the introduction of models such as OpenAI’s O-series and DeepSeek’s R-series, which have driven a paradigm shift toward perception-centric intelligence. However, there remains a lack of systematic surveys that examine perception from a truly unified vision-language perspective – one that treats vision and language as an inseparable modality. Existing reviews are often fragmented, focusing separately on either vision or language, and thus rarely capture the cross-modal evolution of perception as an integrated capability. To bridge this gap, we present the first systematic survey of unified vision-language perception in MLLMs. Specifically, we (1) formalize MLLM perception as an intrinsic, unified vision-language capability analogous to human innate perception, (2) introduce a five-stage taxonomy tracing the paradigm evolution of MLLM perception and survey representative methods and milestones at each phase, and (3) identify open challenges and outline promising research directions toward truly general, unified multimodal intelligence. We hope our study will provide both a foundational understanding and an actionable roadmap to foster further innovation on the path toward artificial general intelligence (AGI).
[NLP-73] Neural Speaker Diarization via Multilingual Training: Evaluation on Low-Resource Nepali-Hindi Speech
【速读】: 该论文旨在解决低资源语言(如尼泊尔语-印地语混合语)在说话人辨识任务中因标注语音数据稀缺而导致的性能显著下降问题。现有端到端神经说话人辨识系统在高资源语言(如英语)上表现优异,但在低资源语言上的泛化能力受限。为应对这一挑战,论文提出一种多语言训练范式,通过融合来自LibriSpeech(英语)、VoxCeleb(多说话人录音)以及自收集的尼泊尔语与印地语语音数据的多语言语料库进行联合训练,以降低语言偏见并增强跨语言泛化能力。其核心解决方案在于采用两种先进的架构——基于编码器-解码器吸引子的EEND-EDA和基于Perceiver结构的吸引子的DiaPer,其中后者利用其全局注意力机制实现更优的跨语言特征建模。实验结果表明,DiaPer在尼泊尔语-印地语测试集上于2、3、4及混合说话人场景下分别取得3.28%、2.02%、4.05%和4.76%的错误率(DER),显著优于EEND-EDA(分别为1.50%、9.68%、16.17%、11.19%),验证了基于Perceiver的端到端说话人辨识在低资源多语言语音处理中的有效性与潜力。
链接: https://arxiv.org/abs/2606.26144
作者: Samip Neupane,Sandesh Pokhrel,Sandesh Pyakurel,Basanta Joshi
机构: Pulchowk Campus, Institute of Engineering, Lalitpur, Nepal
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 12 pages, 7 tables
Abstract:Speaker diarization, the task of determining “who spoke when” in a multi-speaker recording, is a critical component in applications such as meeting transcription, accessibility tools, and multilingual information retrieval. While end-to-end neural diarization systems have achieved strong performance for English and other high-resource languages, their effectiveness degrades substantially for underrepresented languages where annotated speech data is scarce. This paper investigates speaker diarization for low-resource Nepali-Hindi speech through a multilingual training approach, comparing two modern architectures: EEND with encoder-decoder attractors (EEND-EDA) and EEND with Perceiver-based attractors (DiaPer). Both models are trained on a multilingual corpus combining English speech from LibriSpeech, diverse speaker recordings from VoxCeleb, and separately collected Nepali and Hindi audio, a setup designed to reduce language bias and encourage cross-lingual generalization. We evaluate both models across 2-speaker, 3-speaker, 4-speaker, and mixed-speaker scenarios on LibriSpeech, VoxCeleb, and Nepali-Hindi (NeHi) test sets. DiaPer achieves stronger overall performance than EEND-EDA, particularly in more challenging multi-speaker conditions, obtaining DERs of 3.28%, 2.02%, 4.05%, and 4.76% on NeHi 2-speaker, 3-speaker, 4-speaker, and mixed-speaker settings, respectively, compared to 1.50%, 9.68%, 16.17%, and 11.19% for EEND-EDA. These results demonstrate the viability of Perceiver-based end-to-end neural diarization for low-resource multilingual speech processing. Comments: 12 pages, 7 tables Subjects: Sound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2606.26144 [cs.SD] (or arXiv:2606.26144v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2606.26144 Focus to learn more arXiv-issued DOI via DataCite
[NLP-74] hinking Like a Scientist? A Structural Study of LLM -Generated Research Methods
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在极少提示(minimal prompting)条件下,针对科研问题所生成的研究方法建议中存在的系统性偏差问题。其核心挑战在于:尽管LLMs被广泛用于辅助研究方法设计,但其默认的方法推荐倾向在缺乏明确指令时仍不清晰,可能对研究者的创新性和方法多样性造成隐性限制。解决方案的关键在于通过构建一个统一的结构化方法特征分类体系(shared taxonomy),将来自1,000篇arXiv计算机科学论文的真实实验方法与GPT-5.1、Gemini 3 Pro和DeepSeek-V3.2等模型仅基于研究问题生成的方法建议进行量化对比。研究发现,模型推荐存在显著的供应商偏好失衡(Jensen-Shannon散度达其他维度的3–5倍),学术/单次使用模型被严重低估(低23–24个百分点),而重复使用的学术或社区模型则略有高估(高4–6个百分点);同时,方法推荐范围大幅收窄,有效模型实体数量从1,232骤降至59–96,且不同模型间的方法推荐高度一致(互相关系数0.55–0.68),远高于模型与原始论文间的相关性(0.33–0.56),表明偏差具有跨模型共性。结合流行度基线、BM25检索校准及论文级相似性测试,进一步验证了输出为特定查询响应,但受限于狭窄的选项池。因此,研究结论强调:依赖未经交叉验证的LLM建议会无意中压缩方法探索空间,形成一种集中化的默认倾向,进而削弱研究的多样性与原创性。
链接: https://arxiv.org/abs/2606.26130
作者: Francesca Carlon,Brecht Verbeken,Vincent Ginis,Andres Algaba
机构: Vrije Universiteit Brussel (布鲁塞尔自由大学); imec-SMIT (imec-SMIT); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: 46 pages, 13 figures, 18 tables
Abstract:Large Language Models (LLMs) are increasingly used to guide research methodology, yet their default methodological tendencies under minimal prompting remain unclear. Here, we prompt GPT-5.1, Gemini 3 Pro, and DeepSeek-V3.2 with an LLM-extracted research question from each of 1,000 recent arXiv computer-science papers and compare the resulting methodology suggestions against a paper-derived experimental inventory. Since we provide only the research question, the differences we measure reflect initial suggestions and not how optimal those suggestions are. We extract structured method features from both sources, map them into a shared taxonomy, and quantify divergence across multiple taxonomy dimensions including model provider, dataset task type, and evaluation metric type. The strongest imbalance appears in provider choice, with Jensen-Shannon divergence about 3-5x larger than any other taxonomy dimension. Other/Academic single-occurrence models are underrepresented by 23-24 percentage points, while reused academic/community models are slightly overrepresented (4-6pp). LLMs also suggest a much narrower range of methods overall: the effective number of model entities contracts from 1,232 to 59-96, and inter-LLM rank correlations (0.55-0.68) generally exceed LLM-to-paper correlations (0.33-0.56), so the distortions are largely shared across models. Popularity baselines, BM25 retrieval calibration, and paper-level similarity tests confirm that the outputs are query-specific responses, but filtered through a narrower set of options. Researchers who rely on LLM suggestions without cross-checking therefore risk narrowing their methodological search space toward a more concentrated default.
[NLP-75] Dynamic-dLLM : Dynamic Cache-Budget and Adaptive Parallel Decoding for Training-Free Acceleration of Diffusion LLM
【速读】: 该论文旨在解决扩散型大语言模型(Diffusion Large Language Models, dLLMs)在长序列生成与实时应用中面临的高计算复杂度问题,其核心挑战在于模型计算复杂度随序列长度L呈L³增长,且由于非自回归的去噪过程与键值缓存不兼容,导致传统加速手段难以有效应用。现有方法依赖静态缓存或固定并行解码策略,无法捕捉不同层间及解码步骤中令牌属性的动态变化。为此,本文提出无需训练的Dynamic-dLLM框架,其关键创新在于两个组件:动态缓存更新(Dynamic Cache Updating, DCU),根据各层令牌动态特性自适应分配缓存更新预算;自适应并行解码(Adaptive Parallel Decoding, APD),动态调节解码阈值以平衡生成质量与推理效率。实验结果表明,Dynamic-dLLM在LLaDA-8B-Instruct、LLaDA-1.5和Dream-v0-7B-Instruct等模型上,于MMLU、GSM8K和HumanEval等基准测试中实现了平均超过3倍的推理加速,同时保持原有性能水平,显著优于现有先进加速方法,并提供即插即用的部署方案。
链接: https://arxiv.org/abs/2606.26120
作者: Tianyi Wu,Xiaoxi Sun,Yanhua Jiao,Yulin Li,Yixin Chen,YunHao Cao,YiQi Hu,Zhuotao Tian
机构: Harbin Institute of Technology, Shenzhen; Huawei; Shenzhen Loop Area Institute
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Diffusion Large Language Models (dLLMs) offer a promising alternative to autoregressive models, excelling in text generation tasks due to their bidirectional attention mechanisms. However, their computational complexity scales on the order of L cubed with the sequence length L. This poses significant challenges for long-sequence and real-time applications, primarily due to the lack of compatibility with key-value caching and the non-autoregressive nature of denoising steps. Existing acceleration methods rely on static caching or parallel decoding strategies, which fail to account for the dynamic behavior of token properties across layers and decoding steps. We propose Dynamic-dLLM, a training-free framework that enhances dLLM inference efficiency through two components: Dynamic Cache Updating (DCU), which adaptively allocates cache-update budgets based on layer-wise token dynamics, and Adaptive Parallel Decoding (APD), which dynamically calibrates decoding thresholds to balance generation quality and efficiency. Extensive experiments on models like LLaDA-8B-Instruct, LLaDA-1.5, and Dream-v0-7B-Instruct across benchmarks such as MMLU, GSM8K, and HumanEval demonstrate that Dynamic-dLLM significantly improves inference speed. It attains an average speedup exceeding 3 times while maintaining performance. Dynamic-dLLM outperforms state-of-the-art acceleration methods and provides a plug-and-play solution for efficient dLLM deployment without compromising performance. The code is available at this https URL.
[NLP-76] From Lexicon to AI: A Structured-Data Pipeline for Specialized Conversational Systems in Low-Resource Languages
【速读】: 该论文旨在解决低资源语言在人工智能发展中的核心挑战:即在缺乏大规模训练语料库的情况下,如何构建专用对话系统。其解决方案的关键在于提出一种系统性方法,将结构化语言资源(如词典知识库)转化为高质量的对话式AI系统。具体而言,研究通过将印地语WordNet转化为125万条多样化的指令-响应对,利用4比特量化与资源高效的LoRA微调技术,对一个120亿参数的语言模型进行优化。实验结果表明,基于结构化知识的系统在印地语语言学习聊天机器人任务中展现出显著更优的教学有效性(91.0对比通用模型的79.4–83.6),同时保持了良好的语义表现和极高的响应一致性。该方法为具备WordNet资源的任何语言提供了一种无需依赖海量语料的专用AI开发范式,有效填补了低资源语言在人工智能可及性方面的空白。
链接: https://arxiv.org/abs/2606.26112
作者: Siddhant Hitesh Mantri,Dhara Gorasiya,Malhar Kulkarni,Pushpak Bhattacharya
机构: NMIMS, Mumbai; CFILT, IIT Bombay
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures
Abstract:Low-resource languages face a critical challenge in AI development: creating specialized conversational systems without access to massive training corpora. We present a systematic methodology for transforming structured linguistic resources into specialized AI systems, demonstrating that expert-curated lexical databases can serve as effective foundations for conversational AI development. Our approach converts Hindi WordNet into 1.25 million diverse instruction-response pairs, fine-tunes a 12B-parameter language model using resource-efficient LoRA with 4-bit quantization. Evaluation through a Hindi language learning chatbot demonstrates that structured-knowledge-based systems achieve superior pedagogical effectiveness (91.0 vs. 79.4-83.6 for general-purpose models) while maintaining competitive semantic performance and exceptional consistency. The complete pipeline demonstrates a proof-of-concept methodology using Hindi for developing specialized AI systems for any languages with WordNet resources. This work addresses the critical gap in AI accessibility for low-resource languages, offering a practical alternative to corpus-intensive approaches and potentially enabling specialized AI development for the hundreds of languages with existing WordNet resources.
[NLP-77] Where Larger Models Excel: The Primacy of Constraint-Guided Reasoning
【速读】: 该论文旨在解决大模型在推理任务中性能优于小模型的机制问题,特别是揭示导致这一性能差距背后的推理差异。其核心挑战在于,尽管已有大量实证表明大模型在数学、物理、化学及编程等多领域推理基准上持续领先,但这些优势的具体推理过程特征仍缺乏系统性分析。为应对这一问题,论文提出一种名为AdvCluster的自动化框架,其关键在于:通过对比大、小模型在相同问题上的推理轨迹,识别出大模型表现出稳定优势的题目;进而从成对的推理路径中提取细粒度的优势描述,并利用语义聚类结合评审模型进行量化评估与筛选,从而构建系统化的推理优势分类体系。研究发现,大模型的优势可归纳为跨领域的共性特征与特定领域的专有特征,其中贯穿始终的核心机制是“约束引导式推理”(Constraint-Guided Reasoning)——即大模型更擅长识别显性和隐性约束条件,将其结构化组织于推理流程中,并有效用于排除不可行路径及验证中间步骤的合理性,从而提升整体推理的准确性与鲁棒性。
链接: https://arxiv.org/abs/2606.26108
作者: Guan-Yi Lin,Hen-Hsen Huang
机构: National Chengchi University (国立政治大学); Academia Sinica (中央研究院)
类目: Computation and Language (cs.CL)
备注: 10 pages, 3 figures,
Abstract:Larger language models consistently outperform smaller ones on reasoning benchmarks, yet the reasoning differences underlying this gap remain underexplored. Across benchmarks in mathematics, physics, chemistry, and programming, we observe stable performance gaps: averaged over datasets, Qwen3-32B outperforms Qwen3-8B by 6.43%, while GPT-OSS-120B exceeds GPT-OSS-20B by 7.38%. To study the reasoning differences behind these gains, we develop AdvCluster, an automated framework that identifies questions where the larger model shows a stable advantage, extracts fine-grained advantage descriptions from paired reasoning traces produced by larger and smaller models, and organizes them through semantic clustering with quantitative evaluation and selection guided by a reviewer model. Our analysis yields a systematic taxonomy of larger model reasoning advantages, spanning both common advantages that recur across domains and specialized advantages associated with particular domains. Across these patterns, a recurring theme is Constraint-Guided Reasoning: larger models are better at identifying explicit and implicit constraints, organizing them into structured reasoning, and using them to rule out infeasible paths and verify intermediate steps.
[NLP-78] Low Resource Multimodal Translation of Nepali Spoken Words into Emotion-Conditioned Sign Language Avatars
【速读】: 该论文旨在解决低资源语言中手语通信系统缺乏情感表达集成的问题,特别是在尼泊尔语(Nepali)这一低资源语言背景下,实现从口语输入到带情感色彩的手语动画生成的跨模态转换。其核心挑战在于如何在资源受限条件下,高效融合语音识别与情感识别,并驱动具有情绪表达能力的手语虚拟形象生成。解决方案的关键在于提出一种轻量级多模态框架NEST-V1,采用共享声学编码器同时完成自动语音识别(ASR)与情感分类任务,实现了81.1%的ASR准确率和79.21%的情感识别准确率。该设计不仅显著提升了参数效率(较独立模型架构减少37%),且整体模型仅含22.1M参数,具备边缘部署可行性,为未来扩展至更大词汇量和更丰富情感表达奠定了可扩展的技术基础。
链接: https://arxiv.org/abs/2606.26107
作者: Jatin Bhusal,Salma Tamang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 5 figures, 9 tables
Abstract:Sign language communication systems, that integrate emotional expression remain underexplored, particularly for low-resource languages. This pilot study presents NEST-V1 (Nepali Emotion and Speech Transformer - Version 1), a proof-of-concept multimodal framework that demonstrates the feasibility of generating emotion-conditioned Nepali Sign Language avatars from spoken input. As a preliminary investigation, we focus on four common Nepali words (“thank you”, “hello”, “house”, “me”) across three emotional states (happy, neutral, sad) to validate our core technical approach. Our lightweight architecture employs a shared acoustic encoder for simultaneous Automatic Speech Recognition and emotion classification, achieving 81.1% ASR accuracy and 79.21% emotion recognition accuracy on a dataset of 600 labeled audio samples from 50 speakers. The system demonstrates 37% parameter efficiency compared to separate model architectures while maintaining a lightweight footprint with only 22.1M parameters suitable for edge deployment. This pilot work establishes the technical foundation for emotion-aware sign language translation in low-resource settings and provides a scalable framework for future expansion to larger vocabularies and more diverse emotional expressions. Our preliminary results indicate the viability of real-time, emotionally expressive sign language communication systems for the hearing-impaired community, with clear pathways for enhancement in subsequent development phases.
[NLP-79] Reducing Conversational Escalation in Large Language Model Dialogue with Nonviolent Communication Constraints
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在涉及人际冲突、挫折与心理困扰等情绪化场景中,可能因对话行为无意加剧矛盾的问题。尽管以往的安全研究主要聚焦于防范显性的有害内容(如毒性言论或政策违规内容),但对那些隐性却具有破坏性的沟通模式关注不足。为此,本文提出通过轻量级提示工程(prompt-level constraints)引入非暴力沟通(Nonviolent Communication, NVC)原则,将NVC的核心理念转化为以过程为导向的对话指导方针:避免归咎责任、强调对用户情感体验的关注,并倡导在提供建议前先进行澄清。其解决方案的关键在于利用NVC框架构建简洁有效的提示约束,在不改变模型架构的前提下,显著降低对话升级概率,尤其在高抵抗度用户情境下仍能维持交互稳定性。实验基于多模型与多用户抵抗水平的双代理仿真框架验证了该方法的有效性,表明简单而系统的沟通规范可显著提升LLM在冲突敏感场景中的可信度与安全性。
链接: https://arxiv.org/abs/2606.26106
作者: Zhixing Sun,Shenghe Xu,Tao Li
机构: Beijing University of Posts and Telecommunications (北京邮电大学); City University of Hong Kong (香港城市大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly used in emotionally charged situations involving interpersonal conflict, frustration, and distress. While prior safety research has focused on preventing explicit harms such as toxic or policy-violating content, less attention has been paid to conversational behaviors that may unintentionally escalate conflict. In this paper, we investigate whether LLMs can be guided toward more de-escalating dialogue behavior through lightweight prompt-level constraints derived from Nonviolent Communication (NVC). We reformulate NVC principles as process-oriented guidelines that discourage blame attribution, emphasize attention to users’ emotional experiences, and encourage clarification before advice. Using a dual-agent simulation framework across multiple instruction-tuned models and user resistance levels, we show that NVC-constrained prompting consistently reduces conversational escalation and stabilizes interactions with highly resistant users. These results suggest that simple communication constraints can meaningfully improve the trustworthiness of LLM dialogue in conflict-prone settings.
[NLP-80] Context Recycling for Long-Horizon LLM Inference
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长对话轮次中因上下文窗口限制和低效的标记(token)使用而导致性能下降的问题。其核心挑战在于如何在保持高回答准确率的同时,有效管理长期对话中的信息冗余与计算开销。解决方案的关键在于提出一种名为ContextForge的上下文回收(context recycling)系统,通过结构化查询生成、外部记忆检索与可控合成相结合,实现对任务相关性信息的跨轮次保留。该方法无需完整重放历史上下文即可高效复用先前计算结果,显著降低标记消耗并维持答案质量。实验表明,在包含15轮对话的结构化医疗问答基准测试中,相较于采用相同基础模型的基线代理,ContextForge在保持相近响应准确性的同时提升了推理一致性并减少了标记使用量,验证了上下文回收作为扩展大语言模型在长时序任务中能力的有效策略,且无需增大上下文窗口或重新训练模型。
链接: https://arxiv.org/abs/2606.26105
作者: Derek Thomas
机构: Independent Researcher; contextforge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) exhibit strong capabilities in short-context reasoning but degrade in performance over long conversational horizons due to context window limitations and inefficient token usage. We introduce ContextForge, a system for context recycling that maintains task-relevant information across turns by combining structured query generation, external memory retrieval, and controlled synthesis. The system enables efficient reuse of prior computation without relying on full context replay, reducing token overhead while preserving answer quality. We evaluate ContextForge using a 15-turn conversational benchmark that tests multi-turn reasoning, back-references, and domain shifts across structured healthcare queries. Compared to a baseline agent using identical underlying models, ContextForge demonstrates improved consistency and reduced token consumption, while maintaining comparable response accuracy. These results suggest that context recycling provides a practical approach for extending LLM capabilities in long-horizon tasks without requiring larger context windows or model retraining. Code and evaluation artifacts are available at this https URL.
[NLP-81] Assert dont describe: Linguistic features that shift LLM reasoning about animal welfare
【速读】: 该论文旨在解决生成式人工智能(Generative AI)在动物福利议题上的立场偏移问题,即训练数据中语言特征如何影响模型对动物福利的倾向性。其核心问题是:哪些语言特征在作为大语言模型(LLM)微调数据时,会显著增强或削弱模型对动物福利的正向支持态度?解决方案的关键在于识别并量化十种语言特征对模型偏好变化的影响。研究发现,八种特征产生统计显著效应:强调确定性、使用明确道德词汇、情感表达、评价性陈述、叙事结构、描绘伤害严重性以及即时时间框架均显著强化模型的亲动物福利立场;而模糊化表达(hedged language)和具体感官描述则削弱该立场。值得注意的是,第一人称视角无显著影响。因此,论文提出关键实践建议:撰写可能进入大模型训练语料库的动物福利文本时,应明确表达立场而非采取中立场景描写。真正驱动模型偏移的语言特征是那些显性表达作者立场的要素,而仅承载内容但隐匿立场的语言特征反而稀释了模型的倾向性。
链接: https://arxiv.org/abs/2606.26104
作者: Jasmine Brazilek,Harper Dunn
机构: Compassion Aligned Machine Learning (CaML); RunPod
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Animal-welfare advocates produce a lot of writing, and increasingly that writing trains the language models that millions of people then ask about animal welfare. Using vocabulary-matched stance-contrast probes on a held-out animal-welfare benchmark, we measure how each of ten linguistic features changes Llama-3.2-1B’s preference for pro-animal-welfare reasoning when used as fine-tuning data. Eight of the ten features produce statistically significant shifts. Seven move the model toward stronger pro-animal-welfare reasoning: assertive certainty, explicit moral vocabulary, emotion words, evaluative claims, narrative structure, depicted harm severity, and immediate temporal framing. Two move it the other way: hedged language and concrete sensory description both dilute the pro-animal-welfare stance. First-person perspective has no statistically significant effect. The practical recommendation for anyone writing animal-welfare text that may end up in LLM training corpora: assert a position rather than describe a scene neutrally. The features that shift the model are the ones that make the writer’s position explicit; the features that dilute it hold animal-welfare content but withhold stance.
[NLP-82] Investigating LLM s Problem Solving Capability – a Study on Statics Questions
【速读】: 该论文旨在解决生成式 AI(Generative AI)在机械工程教育领域,特别是在静力学问题求解中的实际表现评估问题。现有研究多依赖公开通用数据集,缺乏针对特定学科题型的系统性分析,尤其在涉及多步推理与图文结合的问题上存在研究空白。本文的关键解决方案是采用模型蒸馏(model distillation)方法,从ChatGPT中提取25个纯文本形式的静力学问题,并进一步构建包含图表信息及数值变异的两个扩展数据集,以更真实地模拟教学场景。实验表明,尽管大语言模型在纯文本问题上表现良好,但在引入图形信息并需进行多步推理时准确率显著下降;深入分析揭示,性能下降并非主要源于图像识别能力不足,而是由于模型在跨步骤保持视觉信息一致性以及复杂推理链构建方面存在根本性局限。
链接: https://arxiv.org/abs/2606.26103
作者: Tanner Culleton,Hung-Fu Chang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, Engineering and Technology Symposium 2026
Abstract:Large Language Models (LLMs) have rapidly influenced many aspects of society, particularly education, due to their demonstrated ability to complete assignments and examinations across a wide range of subjects. Although prior studies have examined the educational impact of LLMs, much of the existing work relies on public or open problem datasets and lacks topic-specific analysis. In engineering education, especially within mechanical engineering, systematic investigations of LLM performance on specific problem types remain limited. Instead of using traditional methods that directly ask textbook questions to an LLM tool, our study adopts a model distillation process to evaluate LLM capabilities in solving statics problems. By distilling ChatGPT, we extracted 25 text-only statics questions and further constructed two additional datasets by adding diagrams and modifying their numerical values. Experimental results show that while LLMs perform well on text-only statics problems, their accuracy decreases when diagrams are introduced and the problems require multi-step reasoning. Further analysis suggests that this performance drop is not primarily caused by limitations in image recognition, but rather by difficulties in multi-step reasoning and in consistently applying extracted visual information across successive solution stages.
[NLP-83] Helpfulness Hurts: Domain-Dependent Degradation of Mid-Trained Compassion Values Under Post-Training
【速读】: 该论文旨在解决大语言模型在后训练阶段(post-training)因采用特定任务导向的微调策略而导致预训练阶段所注入的伦理价值观(如动物共情)被削弱的问题。其核心关切在于:不同领域数据(如以助人为核心的任务数据与以编程为核心的代码生成数据)在后训练过程中对模型价值保留的影响是否存在差异。研究的关键发现是,以“助人”为导向的后训练(包括监督微调SFT和基于奖励的强化学习GRPO)显著损害了模型在动物伤害基准(AHB 2.2)上的动物共情表现,而以“编程”为导向的后训练则能有效保持甚至提升该能力;同时,在英文版道德推理不确定性基准(MORU)上,“助人”训练使一般道德推理能力下降25.5个百分点,但此效应在多语言环境下消失,表明推理能力的退化具有语言依赖性,而动物共情值的保留则表现出跨语言一致性。这一结果揭示出:通过中段训练(mid-training)植入的价值观更深层、更具泛化性,而任务特定的后训练可能仅增强表层推理能力并破坏深层价值结构。因此,解决方案之关键在于:在构建价值对齐模型时,应优先选择非助人领域的后训练数据(如编程数据),以更好地维持中段训练阶段所建立的伦理价值观,同时不牺牲通用推理能力。
链接: https://arxiv.org/abs/2606.26102
作者: Jasmine Brazilek,Juliana Seawell
机构: Compassion Aligned Machine Learning (CaML)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Standard post-training pipelines apply supervised fine-tuning (SFT) and reinforcement learning (RL) to make language models helpful, but these processes may inadvertently degrade values instilled during pre-training. We investigate whether the domain of post-training data differentially affects the retention of animal compassion values in a Llama 3.1 8B model mid-trained on compassion-oriented synthetic data, using both SFT (helpfulness via Dolly-15k vs. coding via Magicoder-110K) and GRPO (helpfulness via RLHFlow vs. coding via Magicoder), evaluated on the Animal Harm Benchmark (AHB 2.2) and MORU benchmark (Moral Reasoning Under Uncertainty). Helpfulness training significantly degrades animal compassion relative to coding training on AHB (SFT: 35.7% vs. 65.2%; GRPO: 18.7% vs. 32.0%), replicating across two independent helpfulness datasets and two training paradigms. On English MORU items, helpfulness training degrades general moral reasoning by 25.5 percentage points (46.4% vs. 71.9%), a striking gap that rivals the compassion effect in magnitude. However, this effect does not transfer cross-lingually: on the multilingual MORU benchmark, the domain effect disappears (SFT: 52.3% vs. 51.2%). In contrast, the animal compassion effect transfers consistently across languages, with Magicoder’s AHB percentage-point gain over the base model 4.5 times larger on non-English items than English items. This divergence suggests that values instilled through mid-training are encoded more deeply and cross-lingually than reasoning improvements from domain-specific post-training. These results suggest that, for labs building on value-laden mid-training, coding-domain post-training may better preserve mid-trained values than helpfulness post-training without harming general reasoning capabilities.
[NLP-84] Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)评估中难以区分有依据的回答与无依据的猜测的问题,同时避免将数据污染(data contamination)、提示特定性(prompt idiosyncrasy)或泛化拒绝行为(generic refusal behavior)等干扰因素混入评估结果。其核心解决方案是提出一种污染感知、多区域(multi-zone)基准测试框架,通过在冻结的构建时标签(frozen build-time labels)下测量模型从可回答知识到预期放弃回答的未知项之间的过渡过程,实现对模型回答能力、放弃回答意愿及拒绝行为的精细化分离评估。该基准包含跨五个领域的1,200个样本,明确标注了放弃回答的预期,提供污染风险元数据,并采用双解析机制(官方严格解析器 + 标准化鲁棒解析器),确保评估结果的可复现性与稳健性。实验表明,尽管更强的指令微调模型展现出一定程度的选择性放弃倾向,但整体上仍存在校准不足、对预期回答区域处理困难以及良性项目误拒等问题,验证了该基准在揭示模型真实能力边界方面的有效性。
链接: https://arxiv.org/abs/2606.26101
作者: Renwei Meng,Bowen Zhang,Jian Wang,Xican Wang,Haoyi Wu,Xuanyan Qiu,Shengan Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 3 figures
Abstract:Reliable evaluation of large language models should separate supported answering from unsupported guessing without conflating either with data contamination, prompt idiosyncrasy, or generic refusal behavior. We present a contamination-aware, multi-zone benchmark for measuring the transition from answerable knowledge to abstention-expected unknowns under frozen build-time labels. The benchmark contains 1,200 items across five domains, explicit abstention expectations, contamination-risk metadata, and dual parsing with an official strict parser plus a normalized robustness parser. We evaluate FLAN-T5, Qwen2.5-Instruct, and Llama-3-Instruct models under locked answer-or-abstain prompts, answer-only controls, and prompt-template variants. The benchmark is not solved by generic non-answer behavior: FLAN baselines remain weak on productive abstention, while stronger instruction-tuned models expose a selective but incomplete transition from answering to abstaining. Qwen2.5-3B-Instruct achieves the best overall reliability, but answer-expected zones remain difficult, calibration remains poor, and benign-item refusal persists. Prompt and parser robustness analyses preserve the main ranking and qualitative conclusions. The benchmark therefore provides a reproducible protocol for auditing answerability, abstention, refusal, and contamination as distinct but interacting dimensions of LLM this http URL dataset is publicly available at this https URL.
[NLP-85] HierBias: Context-Conditioned Hierarchical Media Bias Detection with Multi-Task Type Classification
【速读】: 该论文旨在解决现有媒体偏见检测方法在句子层面独立分类时忽略上下文信息的问题,即未能利用人类标注者自然依赖的句间语境信号。其核心解决方案是提出一种分层上下文感知的偏见检测框架HierBias,通过形式化建模文档级上下文来提升偏见预测性能。关键创新在于引入“上下文条件化的偏见概率”,并从理论上证明:当句间存在互信息时,利用文档上下文可严格降低贝叶斯误差;同时,多任务泛化界分析表明,联合训练二元偏见检测与细粒度偏见类型分类能显著提高小规模标注语料下的样本效率。架构上,HierBias结合了句子级RoBERTa编码器、跨句Transformer聚合模块以及双输出头(分别用于二分类和四分类),在BABE与BASIL数据集上分别取得0.853 F1和0.723 MCC,优于当前最优模型(F1提升+2.6%,MCC提升+4.3%,McNemar检验,p < 0.05)。消融实验进一步验证了各理论组件的独立且一致贡献。
链接: https://arxiv.org/abs/2606.26100
作者: Kaining Li,Ruichen Yan,Yuxin Dong
机构: Xidian University (西安电子科技大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Media bias detection is a critical task for ensuring fair and balanced information dissemination, yet existing sentence-level approaches classify each sentence independently, ignoring inter-sentence contextual signals that human annotators naturally exploit. We present \textbfHierBias, a hierarchical context-conditioned media bias detector that formally models document context in bias prediction. We introduce the \emphcontext-conditioned bias probability and prove theoretically that leveraging document context strictly reduces the Bayes error of sentence-level classification when inter-sentence mutual information is non-zero. A multi-task generalization bound further establishes that jointly training binary bias detection and fine-grained bias type classification improves sample efficiency on small annotated corpora. Architecturally, HierBias pairs a sentence-level RoBERTa encoder with a cross-sentence Transformer aggregator and dual output heads for binary detection and four-class type classification. Evaluated on BABE and BASIL, HierBias achieves 0.853 F1 and 0.723 MCC, surpassing the state-of-the-art bias-detector by +2.6% F1 and +4.3% MCC (McNemar’s test, p 0.05 ). Ablation experiments confirm that each theoretical component contributes independently and consistently.
[NLP-86] Agent Capsules: Quality-Gated Granularity Control for Multi-Agent LLM Pipelines
【速读】: 该论文旨在解决多智能体(multi-agent)流水线在运行时因并行调用大语言模型(LLM)导致的高上下文开销问题,尤其是通过合并多个智能体调用以减少令牌消耗(token consumption)时所引发的质量下降。其核心挑战在于:尽管合并调用可显著节省输入令牌,但盲目合并会导致工具丢失(tool loss)和提示压缩(prompt compression),进而损害生成质量。为此,论文提出“智能体胶囊”(Agent Capsules)——一种自适应执行运行时框架,将多智能体执行建模为一个带有经验质量约束的优化问题。该框架的关键创新在于:通过每组协调开销的度量、组合机会评分、三种复合执行策略的选择,以及基于滚动均值输出质量的模式切换门控机制,动态平衡效率与质量。实验表明,增加上下文并不能缓解提示压缩,因此框架采用渐进式升级策略(标准→两阶段→顺序执行),逐步回归至按智能体分发的执行方式以恢复质量。在多个基准测试中,该框架在无需模型特定配置或训练数据的前提下,实现了与人工调优基线相当甚至更优的表现:在14智能体竞争情报流水线中,细粒度模式下输入令牌减少51%,复合模式下减少42%,质量分别提升0.020和0.017;在5智能体尽职调查流水线中,相较未编译的DSPy减少19%令牌,相较MIPROv2减少68%令牌且质量更高。此外,即使在未启用复合执行前,系统已通过自动策略解析、缓存对齐提示及拓扑感知上下文注入实现高效运行,达到与手工调优和编译时基线相当的性能。
链接: https://arxiv.org/abs/2605.00410
作者: Aninda Ray
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 17 pages, 7 figures. Code: this https URL
Abstract:A multi-agent pipeline with N agents typically issues N LLM calls per run. Merging agents into fewer calls (compound execution) promises token savings, but naively merged calls silently degrade quality through tool loss and prompt compression. We present Agent Capsules, an adaptive execution runtime that treats multi-agent pipeline execution as an optimization problem with empirical quality constraints. The runtime instruments coordination overhead per group, scores composition opportunity, selects among three compound execution strategies, and gates every mode switch on rolling-mean output quality. A controlled negative result confirms that injecting more context into a merged call worsens compression rather than relieving it, so the framework’s escalation ladder (standard, then two-phase, then sequential) recovers quality by moving toward per-agent dispatch rather than by rewriting merged prompts. On LLM-judged quality, the controller matches a hand-tuned oracle on every measured (model, group, mode) cell: routing compound whenever the oracle would, and reverting to fine whenever quality would fail the floor, without per-model configuration. Against a hand-crafted LangGraph implementation of a 14-agent competitive intelligence pipeline, Agent Capsules uses 51% fewer fine-mode input tokens and 42% fewer compound-mode input tokens, at +0.020 and +0.017 quality respectively. Against a DSPy implementation of a 5-agent due diligence pipeline, the framework uses 19% fewer tokens than uncompiled DSPy at quality parity, and 68% fewer tokens than MIPROv2 at +0.052 quality. Even before compound mode fires, the runtime delivers efficiency through automatic policy resolution, cache-aligned prompts, and topology-aware context injection, matching both hand-tuned and compile-time baselines without training data or per-pipeline engineering.
[NLP-87] Patent Representation Learning via Self-supervision
【速读】: 该论文旨在解决长而结构化的专利文档在自监督表示学习中正样本构造不足的问题,尤其针对传统基于随机丢弃(dropout)的正样本生成方法在专利场景下泛化能力差、配置依赖评估任务且跨任务迁移性弱的局限性。其解决方案的关键在于提出一种专利特有的视图构建策略——混合丢弃-章节正样本(mixed dropout–section positives),即以标题-摘要作为锚点视图,正样本则来自同一专利的另一章节(如权利要求、摘要、背景、附图或说明书)或对同一视图进行独立丢弃重编码。该方法利用专利内部的结构信息作为训练信号,无需依赖IPC分类标签、引用关系或相关性标注,从而实现更有效的自监督学习。实验表明,该策略在欧洲专利局检索报告分级任务、家族级专利检索基准DAPFAM及IPC子类分类任务上均优于校准后的纯丢弃基线和通用标题-摘要增强方法,性能接近基于引用信息的专利编码器,并在DAPFAM的跨域测试集上表现优异。交叉章节对齐诊断进一步验证了该方法提升了同一发明的摘要、权利要求与说明书之间的语义一致性,证明专利章节可作为有效且可迁移的自监督正样本视图,显著增强密集专利表示的学习效果。
链接: https://arxiv.org/abs/2511.10657
作者: You Zuo(ALMAnaCH),Kim Gerdes(LISN),Eric Villemonte de La Clergerie(ALMAnaCH),Benoît Sagot(ALMAnaCH)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We study self-supervised patent representation learning with contrastive objectives. A standard baseline constructs positives by encoding the same text twice under independent dropout masks, but applying this recipe to long, structured patent documents requires careful calibration. We show that dropout-only training can be substantially strengthened by tuning temperature and dropout rate, yet its best configuration is evaluation-dependent and does not transfer uniformly from title–abstract retrieval to claim-to-disclosure retrieval. We propose mixed dropout–section positives, a patent-specific view construction strategy in which the anchor is the title–abstract view and the positive is sampled either from a dropout re-encoding of the same view or from another section of the same patent, such as claims, summary, background, drawings, or description. This uses patent-internal structure as a training-time signal without IPC labels, citations, or relevance annotations. We evaluate on graded EPO search-report retrieval, DAPFAM, a recently proposed family-level patent retrieval benchmark, and IPC subclass classification. Section-based positives improve over calibrated dropout-only and generic title–abstract augmentation baselines, are competitive with citation-informed patent encoders and a general-purpose embedding model, and perform strongly on the out-of-domain split of DAPFAM. Additional cross-section alignment diagnostics show that section-pair training improves compatibility among abstracts, claims, and descriptions of the same invention. These results indicate that patent sections provide effective self-supervised positive views for learning dense patent representations.
信息检索
[IR-0] NOVA: A Verification-Aware Agent Harness for Architecture Evolution in Industrial Recommender Systems
链接: https://arxiv.org/abs/2606.27243
作者: Shaohua Liu,Liang Fang,Yilong Sun,Shudong Huang,Qingsong Luo,Xiaoyang Chen,Dongqiang Liu,Chuangang Ma,Zhenzhen Chai,Henghuan Wang,Shijie Quan,Changyuan Cui,Zhangbin Zhu,Peng Chen,Wei Xu,Lei Xiao,Haijie Gu,Jie Jiang
类目: Information Retrieval (cs.IR); Software Engineering (cs.SE)
备注: 12 pages, 3 figures
Abstract:Industrial advertising recommender models are continuously improved through architecture evolution. Upgrades such as RankMixer, TokenMixer-Large, and MixFormer show that better structures remain a key source of quality and business gains. Yet developing such upgrades in production is expert-intensive and difficult to scale. Existing automation is insufficient: AutoML mainly tunes hyper-parameters, while effective gains often require cross-module changes under strict constraints; generic LLM coding agents optimize for runnable code, but runnable code does not imply a valid recommender architecture. Candidates may pass local tests while causing silent failures that degrade performance. We present NOVA, a level-aware agent harness for verification-aware architecture evolution. NOVA uses an architecture gradient, an SGD-inspired, non-differentiable update signal that aggregates prior modifications, verification diagnostics, metric feedback, and trajectory memory to guide the next modification. A verification cascade checks structure semantics, local executability, offline effectiveness, and online impact; invalid candidates are blocked early, with failure patterns recorded as forbidden directions. L1–L4 task-level control matches automation to task complexity and risk, routing high-risk tasks to Copilot for human oversight. Deployed in an industrial advertising system, NOVA achieves the highest effective pass rate on L2 ScaleUp and L3 Literature-to-Production tasks (54.5% and 60.0%), reduces silent failures compared with coding-agent baselines, and shortens one literature-to-production cycle by over 13x in human-attended time. In online A/B testing, the selected L3 candidate improves GMV on three pCVR objectives by +1.25%, +1.70%, and +2.02%, while reducing pCVR bias by 58.8%, 66.7%, and 37.3%. Comments: 12 pages, 3 figures Subjects: Information Retrieval (cs.IR); Software Engineering (cs.SE) Cite as: arXiv:2606.27243 [cs.IR] (or arXiv:2606.27243v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2606.27243 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-1] RUST: Item-Calibrated Interval Evidence for Temporal Session-Based Recommendation
链接: https://arxiv.org/abs/2606.27214
作者: Linjiang Guo,Nitin Bisht,Shiqing Wu,Yifan Yin,Guandong Xu
类目: Information Retrieval (cs.IR)
备注:
Abstract:Temporal signals have been widely used in session-based recommendation to infer user interest. Existing temporal session-based recommenders primarily rely on absolute interval values, implicitly assuming that the same interval carries similar interest signals across items. However, we empirically find that this assumption does not hold: each item has its own interval distribution, so an interval should be interpreted relative to the item it belongs to. Based on this observation, we propose TRUST, a framework that evaluates each observed interval relative to the empirical interval distribution of the corresponding item. Specifically, we propose a score function to guide global neighbor sampling, session graph encoding, and final interest aggregation. Experiments on public datasets show that TRUST consistently improves over representative temporal and non-temporal baselines, and plug-in experiments further show that the proposed scoring function can improve existing temporal session recommenders as a model-agnostic method. Component-wise ablations further show that calibrating the temporal signals within each module, rather than removing the module itself, consistently improves neighbor sampling, session graph encoding, and interest aggregation.
[IR-2] UniFormer: Efficient and Unified Model-Centric Scaling for Industrial Recommendation
链接: https://arxiv.org/abs/2606.27058
作者: Bo Chen,Jinlong Jiao,Tijian Hu,Ruihao Zhang,Yanzhi Liu,Chenghou Jin,Qinglin Jia,Baixuan He,Hechang Pan,Yiwu Liu,Jian Liang,Chaoyi Ma,Ruiming Tang,Han Li,Kun Gai
类目: Information Retrieval (cs.IR)
备注:
Abstract:Recently, substantial progress has been made in industrial recommendation through component-centric model scaling, where individual components such as behavior modeling, feature interaction, or task modeling are independently scaled to improve model capacity. Although recent methods such as HyFormer and OneTrans further explore cross-module co-scaling by jointly modeling behavior and interaction, their designs are still confined to the feature space and lack a unified model-centric scaling framework over the overall modeling space. In this paper, we propose UniFormer, an efficient and unified model-centric scaling framework for industrial recommender systems. To improve efficiency, UniFormer decomposes the overall modeling space into feature and task spaces, which are modeled by stacked Feature-space Interaction Modules and Task-space Interaction Modules, respectively. Moreover, UniFormer introduces semantic-based tokenization scheme to enable user-item decoupling, thereby achieving request-level inference acceleration. To prevent preference collapse, UniFormer employs multi-sequence cross-attention to separately capture heterogeneous behavior patterns, followed by the self-attention to enhance interaction modeling. Besides, dedicated multi-view FFNs are introduced to support flexible and scalable parameter scaling across different modeling components. Extensive online A/B testing in two production scenarios, Kuaishou and Kuaishou Lite, shows that UniFormer consistently improves user engagement and interaction metrics, achieving gains of +0.101%/+0.260% in App Stay Time and +0.729%/+1.113% in Watch Time, respectively.
[IR-3] riPAH: Imbalance-Aware Tri-Prompt Affinity Hashing for Cross-Modal Medical Retrieval
链接: https://arxiv.org/abs/2606.27010
作者: Jiaming Bian,Songming Li,Yurui Song,Yunfei Chen,Yichao Cao,Jun Long
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: 10 pages, 3 figures, 4 tables
Abstract:In the era of big medical data, efficient cross-modal retrieval is pivotal for evidence-based diagnosis and large-scale case management. Cross-modal medical hashing retrieval aims to enable efficient image-text search and support downstream tasks such as case-based reasoning and decision support by learning compact, semantically aligned binary codes. However, current methods suffer from semantic fragmentation due to noisy clinical language, long-tailed labels, and brittle quantization that weakens alignment. We propose TriPAH, a Tri-Prompt Affinity Hashing framework. TriPAH synthesizes ontology-grounded, patient-level prompts conditioned on normalized clinical cues to yield low-noise textual representations for initial alignment. A lightweight prompt-token mixer performs hierarchical, multi-granularity alignment and produces quantization-ready features under an asymmetric multi-task objective coupling multi-positive contrastive alignment, imbalance-aware classification, and progressive quantization regularization. A patient-level consistency module further stabilizes codes across complementary views. Extensive experiments on three public datasets demonstrate that TriPAH significantly outperforms state-of-the-art methods.
[IR-4] Agent X: Towards Agent -Driven Self-Iteration of Industrial Recommender Systems
链接: https://arxiv.org/abs/2606.26859
作者: Changxin Lao,Fei Pan,Guozhuang Ma,Han Li,Huihuang Lin,Jijun Shi,Kangzhi Zhao,Kun Gai,Mo Zhou,Qinqin Zhou,Quan Chen,Ruochen Yang,Shifu Bie,Shuang Yang,Shuo Yang,Wenhao Li,Wentao Xie,Xiao Lv,Xuming Wang,Yijun Wang,Yiming Chen,Yusheng Huang,Zhongyuan Wang,Zibo Zhao,Zijie Zhuang,Baoning Xia,Chao Liu,Chaoyi Ma,Chubo He,Dawei Cong,Feng Jiang,Gang Wang,Guilin Xia,Hanwen Xu,Jiahong Xie,Jiahui Qiao,Jian Liang,Jiangfan Yue,Jing Wang,Jinghan Yang,Jinghui Jia,Kan Qin,Lei Wang,Ming Li,Peilin Song,Pengbo Xu,Qiang Luo,Ruiming Tang,Shiyang Liu,Shuxian Jin,Tao Wang,Tao Zhang,Xiang Gao,Xianghan Li,Yingsong Luo,Yiwen Ning,Yongcheng Liu,Yuan Guo,Zhaojie Liu,Zhenkai Cui
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Authors are listed alphabetically by their first name
Abstract:Recommendation algorithm iteration is moving from an artisanal, engineer-bound process toward an industrialized research loop, but this transition remains blocked by a structural execution bottleneck: the idea-to-launch cycle still depends on human engineers to generate hypotheses, modify production code, launch A/B experiments, and attribute online results. Innovation therefore scales linearly with headcount rather than compounding with evidence, compute, and accumulated experimental knowledge. We present AgentX, a production-deployed multi-agent system that fundamentally restructures this production function. AgentX operates as a self-evolving development engine: it autonomously generates, implements, evaluates, and learns from recommendation experiments at a scale and pace that no manual workflow can sustain. The system orchestrates four tightly coupled stages in a closed loop. A Brainstorm Agent synthesizes evidence from historical experiments, system architecture, data analysis, and external research into ranked, executable proposals. A Developing Agent translates each proposal into production-ready code through repository-grounded generation and multi-dimensional reliability verification. An Evaluation Agent conducts safe online rollout with guardrail-vetoed A/B judgment, converting both successes and failures into structured knowledge assets. A Harness Evolution layer (SGPO) then distills execution trajectories into semantic-gradient updates that continuously sharpen the agents themselves – making the system not merely automated, but self-improving. Comments: Authors are listed alphabetically by their first name Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2606.26859 [cs.AI] (or arXiv:2606.26859v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.26859 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-5] A Shared IPTC Topic Space for Cross-Source Topic Modelling
链接: https://arxiv.org/abs/2606.26845
作者: Din Iskakov,Sebastian Gonçalves,Marco Idiat,Mendeli Vainstein,Aline Villavicencio,Ronaldo Menezes,Rodrigo Wilkens
类目: Information Retrieval (cs.IR)
备注:
Abstract:Comparing topic attention across different media is hindered by a fundamental modelling problem: topic models fitted separately to each corpus produce corpus-specific topic spaces that cannot be aligned directly. This paper presents a reproducible framework that places corpora in a single shared topic space defined by a taxonomy. Discovered topics are obtained with guided BERTopic, scored against the ninety-four IPTC Media Topics’ taxonomy topics (level-1) through weighted keyword and target centroids, and then collapsed upward to seventeen IPTC parent topics by a maximum-similarity rule. The framework was developed and selected on a controlled New York Times 2011 corpus through a narrowing sequence: a broad model screen, a focused mapping refinement, a strict finalist comparison, a target-construction ablation, and a threshold calibration. In this corpus, the guided family retained substantially stronger mapped coverage than a zero-shot benchmark under stricter assignment thresholds, a parent-enriched target construction improved both coverage and parent consistency, and coverage declined gradually rather than collapsing as the assignment threshold was tightened. The contribution is an externally anchored method for constructing a shared topic space that enables reproducible cross-source topic comparison.
[IR-6] From Vajrayana Tara to Bengali Baul: A Computational Study of Lexical Transmission Across Buddhist Shakta and Vaishnava Traditions in Bengal
链接: https://arxiv.org/abs/2606.26803
作者: Joy Bose
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 9 pages, 2 figures, 4 tables. Code and corpus: this https URL Dataset: this https URL
Abstract:We present a computational corpus study of vocabulary relationships across eight tradition layers of Bengali and Sanskrit devotional literature spanning the 8th to 19th centuries, encompassing Buddhist Vajrayana, Shakta Tantra, Vaishnava, and Baul traditions. Using a corpus of 75 texts and TF-IDF character n-gram vectorization with cosine similarity analysis, we address the historically argued but previously unquantified claim that Buddhist Vajrayana vocabulary survived the collapse of the Pala monasteries and was absorbed into the Shakta Tantra tradition of Bengal. The central finding is a specificity result: the Gitagovinda (Vaishnava Sanskrit, 12th century) has zero cosine similarity to Shakta Kali texts, while Bridge Tara texts (Buddhist-Shakta transitional, same century, same language) have cosine similarity 0.54 to Shakta Kali. This 8.5-fold contrast between two Sanskrit traditions from the same century demonstrates that the Buddhist-Shakta vocabulary overlap is not a generic property of Sanskrit devotional literature but is specific to the Buddhist-Shakta transmission chain. Three Brihannilatantra Tara texts show Shakta-to-Buddhist vocabulary ratios of 2.0 to 4.0, constituting measurable evidence of lexical transition within that chain. Ramprasad Sen’s 18th-century Bengali Kali songs preserve Buddhist vocabulary residue including 56 occurrences of Tara alongside 103 occurrences of Kali. The Vaishnava Bengali tradition contributes a parallel chain to modern Baul vocabulary (similarity 0.29), slightly weaker than the Buddhist Sahajiya chain via Charyapada (0.31). These results provide the first quantitative multi-tradition corroboration of historically argued Buddhist-Shakta syncretism in Bengal.
[IR-7] ConvMemory v3: A Validity Context Layer for Conversational Memory via Target-Conditioned Relation Verification
链接: https://arxiv.org/abs/2606.26753
作者: Taiheng Pan
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 22 pages, 3 figures
Abstract:Conversational memory retrieval optimizes relevance, yet a retrieved memory can be relevant and simultaneously outdated: a later turn updates, corrects, or supersedes it. ConvMemory v3 adds a validity context layer that detects and surfaces this update evidence through target-conditioned relation verification, sitting after the v1/v2 retrieval path. The core mechanism is a dual-evidence gate that conditions a relation judgment on the specific target proposition, scoring a (target, source) pair through the product of a MiniLM slot head and a DeBERTa-v3 slot head and gating it by conservative event/operation evidence. On a synthetic multi-hop validity benchmark the gate reaches 90.12% +/- 1.73 accuracy; through a real-data feedback loop that mines failure patterns but trains on synthetic pairs only, the verifier transfers to Memora role binding with zero target-side labels, reaching 98.8% +/- 0.9 group-all-correct. The deployed layer preserves retrieval by default: a context mode attaches structured validity metadata while keeping the candidate set and rank order fixed, and a query-conditioned demote mode is an explicit opt-in for dense current-state workloads, where it raises current-active H@1 from a never-demote baseline of 45.1% to 95.7% +/- 1.2 while protecting non-superseded memories at 99.4% recall. Six machine-verifiable safety contracts pin the layer’s behavior. Multi-hop graph propagation is validated as a mechanism; fully automatic construction of strict prerequisite edges is characterized as a boundary, since strict necessity requires counterfactual world knowledge. This report extends ConvMemory v1 (arXiv:2605.28062) and v2 (arXiv:2606.10842).
[IR-8] Attributed But Not Incremental: Cannibalization-Corrected Attribution for Large-Scale Advertising KDD2026
链接: https://arxiv.org/abs/2606.26690
作者: Donghui Li,Bowen Yuan,Zili Yang,Qinxin Chen,Lijing Song
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 6 pages, 3 figures. Accepted at ADKDD 2026
Abstract:In large-scale paid acquisition and growth advertising systems, production attribution outputs are widely used for daily budget allocation and channel diagnosis. However, paid-attributed conversions such as daily new users (DNU) may systematically overstate true incremental growth when paid channels overlap with organic demand, brand-driven traffic, or other acquisition channels. This attribution-cannibalization mismatch can distort incremental ROI measurement and budget decisions at scale. We propose an experiment-calibrated attribution correction framework that uses incrementality experiments as causal anchors to convert sparse lift measurements into daily correction estimates. To make the corrected signal actionable at production granularity, we further allocate calibrated cannibalization volume across business hierarchies under structural consistency constraints. Offline forward-in-time validation against channel-level incrementality experiment readouts shows that the proposed framework substantially reduces calibration error relative to raw attribution and fine-grained ML baselines. Deployed across multiple global TikTok markets, the system supported budget and traffic strategy adjustments that were followed by an approximately 15-percentage-point reduction in the measured cannibalization rate. Comments: 6 pages, 3 figures. Accepted at ADKDD 2026 Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2606.26690 [cs.IR] (or arXiv:2606.26690v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2606.26690 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-9] SocialPersona: Benchmarking Personalized Profiling and Response with Multimodal Social-Media Context
链接: https://arxiv.org/abs/2606.26654
作者: Qinkai Zhang,Yanyan Zhao,Xin Lu,Yulin Hu,Pengtao Han,Bing Qin
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Personalized language-model assistants are often evaluated through a memory lens: can a model recall preferences users have explicitly stated in dialogue? More comprehensive personalization demands a harder capability – inferring what users care about from the multimodal traces they naturally leave behind. We introduce SocialPersona, a benchmark for evaluating whether multimodal large language models (MLLMs) can recover revealed preferences from longitudinal social-media timelines and use them in dialogue. Built from longitudinal timelines of 171 everyday, non-promotional social-media users, SocialPersona contains text, images, timestamps, and 2,597 human-verified preference tags across seven interest domains, separating stable interests from recent interests. It supports two tasks: constructing structured user profiles from multimodal context and generating responses aligned with inferred profiles. Experiments with proprietary and open-weight MLLMs show that models can identify broad interest domains, yet their performance drops on fine-grained and recent interests and degrades further when inferred profiles must be used to personalize dialogue. Together with evidence that text and images provide complementary preference signals, these results indicate that robust cross-modal, long-horizon user modeling remains a key challenge, and that SocialPersona can help measure and advance progress toward assistants that infer and act on revealed preferences.
[IR-10] Utilizing Cognitive Signals Generated during Human Reading to Enhance Keyphrase Extraction from Microblogs
链接: https://arxiv.org/abs/2606.26485
作者: Xinyi Yan,Yingyi Zhang,Chengzhi Zhang
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注:
Abstract:Microblogging platforms generate massive amounts of short, noisy, and dispersed user content, making automatic keyphrase extraction (AKE) an important but challenging task. Prior studies have used eye-tracking signals to improve microblog-based AKE because such signals reflect readers’ attention to salient words. However, eye tracking alone is limited by physiological, acquisition, and feature-decoding constraints. To address this issue, we investigate whether electroencephalogram (EEG) signals can complement eye-tracking signals for AKE. Using the ZuCo cognitive language processing corpus, we select 8 EEG features and 17 eye-tracking features and incorporate them into microblog-based AKE models. To reduce possible distortion of cognitive signals by model structures, we inject these features into the input of the soft-attention layer and the query vectors of the self-attention layer. We then evaluate different combinations of cognitive signals across AKE models. The results show that cognitive signals produced during reading consistently improve AKE performance, regardless of feature combinations and model architectures. EEG features bring the largest gains, while combining EEG and eye-tracking features yields performance between the two individual signal types, suggesting partial complementarity but also possible redundancy or noise. These findings indicate that EEG signals provide useful cognitive evidence for microblog-based AKE and that multimodal cognitive signals deserve further investigation.
[IR-11] Extracting Problem and Method Sentence from Scientific Papers: A Context-enhanced Transformer Using Formulaic Expression Desensitization
链接: https://arxiv.org/abs/2606.26481
作者: Yingyi Zhang,Chengzhi Zhang
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注:
Abstract:Billions of scientific papers lead to the need to identify essential parts from the massive text. Scientific research is an activity from putting forward problems to using methods. To learn the main idea from scientific papers, we focus on extracting problem and method sentences. Annotating sentences within scientific papers is labor-intensive, resulting in small-scale datasets that limit the amount of information models can learn. This limited information leads models to rely heavily on specific forms, which in turn reduces their generalization capabilities. This paper addresses the problems caused by small-scale datasets from three perspectives: increasing dataset scale, reducing dependence on specific forms, and enriching the information within sentences. To implement the first two ideas, we introduce the concept of formulaic expression (FE) desensitization and propose FE desensitization-based data augmenters to generate synthetic data and reduce models’ reliance on FEs. For the third idea, we propose a context-enhanced transformer that utilizes context to measure the importance of words in target sentences and to reduce noise in the context. Furthermore, this paper conducts experiments using large language model (LLM) based in-context learning (ICL) methods. Quantitative and qualitative experiments demonstrate that our proposed models achieve a higher macro F1 score compared to the baseline models on two scientific paper datasets, with improvements of 3.71% and 2.67%, respectively. The LLM based ICL methods are found to be not suitable for the task of problem and method extraction.
[IR-12] 3D Spatial Pattern Matching
链接: https://arxiv.org/abs/2606.26465
作者: Nicole R. Schneider,Avik Das,Lukas Arzoumanidis,Abhijeet Ghodgaonkar,Hanan Samet,Youness Dehbi
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Spatial pattern matching is the process of matching query entities and constraints with database entities and relations. It has many applications, including similar region search, housing market search, landmark search, and road network matching. To our knowledge, all existing spatial pattern matching approaches frame the problem in a 2 dimensional space, where entities lie in a cartesian plane and relationships defined between them are contained in 2 dimensions. However, this problem framing has significant limitations when searching for real world entities that have height in addition to position. To address this limitation, we extend spatial pattern matching to 3 dimensions and provide a generalized definition of the problem. We describe a subgraph matching algorithm capable of resolving 3D spatial patterns over distance relations and release two 3D spatial pattern matching datasets, one synthetic and one containing real 3D building data from the city of Hamburg, Germany. We test our subgraph matching algorithm on both datasets and present results as a baseline for future methods to build upon.
[IR-13] ProvenAI: Provenance-Native Traces of Evidence in Generated Answers
链接: https://arxiv.org/abs/2606.26449
作者: Mohammad Faizan,Dalal Alharthi
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
备注:
Abstract:Retrieval-augmented systems routinely present citations alongside generated answers, yet a citation does not confirm that the corresponding source meaningfully shaped the output. This paper introduces ProvenAI, a framework that decomposes transparency in multi-hop question answering into three independently measurable layers: answer correctness, citation fidelity against benchmark supporting evidence, and per-document influence under leave-one-resource-out intervention. Targeting the HotpotQA distractor benchmark through a seven-stage pipeline covering data normalisation, retrieval indexing, citation-aware answer generation, attribution auditing, ablation-based influence estimation, batch evaluation, and interactive inspection, ProvenAI evaluates 7,405 validation examples drawn from a canonical corpus of 509,300 passages. The system achieves 53.53% answer accuracy alongside a mean citation-fidelity score of 71.55%, and a worked example surfaces what we call the citation-influence gap: a clean citation audit co-occurring with a profile in which one cited source registers only weak influence while seven uncited sources demonstrably shift the output. We formalise the relationship between the implemented surface proxy and a token-level KL-divergence target through a stated faithfulness condition, ground the framework in causal-mediation analysis and database-provenance theory, and discuss how the three measurement layers compose with cryptographic provenance architectures emerging in autonomous scientific discovery. ProvenAI establishes that meaningful transparency in retrieval-grounded QA requires traceable links across retrieved, cited, and behaviourally influential evidence as three distinct, independently measured layers.
[IR-14] GPUSparse: GPU-Accelerated Learned Sparse Retrieval with Parallel Inverted Indices
链接: https://arxiv.org/abs/2606.26441
作者: Ashutosh Sharma
类目: Information Retrieval (cs.IR); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Learned sparse retrieval models such as SPLADE achieve retrieval quality competitive with dense models while preserving the interpretability and exact-match advantages of sparse representations. However, inference-time scoring still relies on CPU-bound inverted index traversal algorithms (WAND, Block-Max WAND), creating a fundamental bottleneck for real-time serving at scale. We present GPUSparse, a system for GPU-accelerated exact learned sparse retrieval that introduces: (1) a GPU-parallel inverted index with block-aligned, warp-coalesced posting lists; (2) a batched scatter-add scoring algorithm that processes hundreds of queries simultaneously; and (3) fused Triton kernels with an analysis of the tradeoff between work-efficiency and hardware utilization. On MS MARCO passage ranking (8.8M passages) with real SPLADE embeddings, GPUSparse matches CPU exact scoring to three decimals (MRR@10=0.383, equal to Pyserini SPLADE at this precision; Recall@1000=0.999 vs. dense matmul, the residual from floating-point tie-breaking) while providing a 235x speedup over Pyserini CPU at 8.8M documents (1.27ms vs. 298ms per query). Compared to Seismic (the fastest CPU sparse retrieval system), which trades 25% recall for speed (R@1000=0.738 vs. 0.983 exact), GPUSparse achieves exact scoring at 787 QPS throughput (batch 500) on the full 8.8M collection, with 1.3ms per query. Our document-parallel kernel reaches 62.6% of H100 peak HBM bandwidth, revealing a fundamental work-efficiency vs. bandwidth-efficiency tradeoff in GPU sparse retrieval. The reformulation of sparse scoring as scatter-add over an inverted index is shared with SPARe’s iterative mode; our contribution is its fused-kernel realization, which we measure to be 23-270x faster than a faithful SPARe iterative reimplementation.
[IR-15] MaxSim: IO-Aware GPU MaxSim Scoring with Dimension Tiling and Fused Product Quantization
链接: https://arxiv.org/abs/2606.26439
作者: Ashutosh Sharma
类目: Information Retrieval (cs.IR); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
备注:
Abstract:Multi-vector retrieval models such as ColBERT achieve state-of-the-art accuracy through fine-grained token-level MaxSim scoring, yet existing GPU implementations leave most hardware performance unused. We give a roofline analysis of MaxSim on modern GPUs and identify a severe bandwidth gap: naive implementations reach only 5-18% of peak HBM bandwidth because they materialize the Nq x Nd similarity matrix, wasting memory traffic on data that is consumed once and discarded. We present TileMaxSim, a family of IO-aware Triton kernels that close this gap via (1) multi-query SRAM tiling that streams document embeddings through shared memory while accumulating per-query-token maxima in registers, reading each embedding from HBM exactly once; (2) dimension tiling that partitions the embedding dimension into 128-wide chunks, enabling scoring for d 128 embeddings that overflow shared memory; and (3) fused product-quantization scoring via shared-memory lookup tables, cutting HBM I/O by up to ~31x. On NVIDIA H100 GPUs, TileMaxSim reaches 80.2% of peak HBM bandwidth and scores 82M documents/second (71.6M/s on real MS MARCO passages), a 220x speedup over loop-based scoring, 6.5x over fused PyTorch, 6.6-8.5x over this http URL, and 469x the scoring throughput of WARP’s CPU engine on the same node. TileMaxSim preserves exact retrieval quality: on MS MARCO and three BEIR benchmarks, rankings match reference MaxSim. As a drop-in replacement in ColBERTv2/PLAID, it cuts scoring latency at 100K candidates from 268 ms to 1.2 ms (98% lower end-to-end latency). We further show constant throughput from 100K to 500K documents, data-parallel multi-GPU sharding, robustness across dimensions 64-768, and FP16/BF16/FP32 support. Concurrent work independently develops an IO-aware fused MaxSim kernel; we differ in dimension tiling for d 128 and fused product-quantization scoring.
[IR-16] Hybrid privacy-aware semantic search: SVD-truncated document geometry and CKKS-encrypted query reranking under a restricted threat model
链接: https://arxiv.org/abs/2606.26373
作者: Sergey Kurilenko
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Dense embeddings power semantic search and retrieval-augmented generation, but embedding-inversion attacks can reconstruct source text from a vector: when a vector database leaks, the documents behind it leak too. The textbook defences are extremes - encrypting the whole search homomorphically is sound but too slow at million-document scale, while privacy noise degrades ranking long before it protects. We study a middle path exploiting the asymmetry between the static collection and the dynamic query. The collection is protected geometrically: each vector is truncated onto a lower-dimensional SVD subspace and rotated by a secret orthogonal transform known only to the owner. The query is protected cryptographically: it is reranked under CKKS homomorphic encryption, so an honest-but-curious server never sees the query or the scores. CKKS parameters come from a small offline benchmark. We prove a tight lower bound on the reconstruction error of any attacker confined to the protected subspace. On one million documents and five encoders the scheme preserves ranking quality (slightly improving it on strong encoders, as a linear denoiser) at sub-second latency, and an off-the-shelf inversion attack on the protected space collapses to the noise floor. We then test stronger adversaries: a known-plaintext attacker recovers the rotation by orthogonal Procrustes from about as many leaked pairs as the retained dimension; the public product-quantization codes preserve most nearest-neighbour structure; and random-projection, calibrated-noise and BEIR baselines show the truncation is an encoder-dependent accuracy cost, not a free denoiser. We state the limits: query confidentiality is cryptographic, but document protection is an empirical obfuscation layer (SVD truncation plus a secret rotation), not a cryptographic primitive, and we delimit the threat model for each claim. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2606.26373 [cs.CR] (or arXiv:2606.26373v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.26373 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-17] Scoring Is Not Enough: Addressing Gaps in Utility-fairness Trade-offs for Ranking
链接: https://arxiv.org/abs/2606.26369
作者: Shubham Singh,Ian A. Kash,Mesrob I. Ohannessian
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:Scoring functions are used to represent the relevance of individual documents. In modern information retrieval or recommendation systems, they are often learned from data and play a pivotal role in ranking sets of documents or items in a way that maximizes utility to a query or user. With the recent interest in algorithmic fairness, the success of scoring has naturally led to methods that learn scores that simultaneously trade off fairness and utility. In this work, we show that in stark contrast with utility-centric objectives, scoring is sub-optimal in achieving all utility-fairness trade-offs. We establish this with a series of counter-examples with a generic fairness formulation. We show that the issue persists whether we have a deterministic scoring function or a randomized one, or whether we measure fairness at the scope of a single query or across multiple queries. On the positive side, we empirically demonstrate that semi-greedy post-processing has the potential to achieve much better trade-offs, often approaching the ideal of exhaustive post-processing in a tractable way.
[IR-18] From Clicks to Intent: Cross-Platform Session Embeddings with LLM -Distilled Taxonomy for Financial Services Recommendations
链接: https://arxiv.org/abs/2606.26277
作者: Dianjing Fan,Yao Li,Kyaw Hpone Myint,Dwipam Katariya,Alexandre G.R. Day,Pranab Mohanty,Giri Iyengar
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Dianjing Fan and Yao Li equally contributed to this work. 7 pages, 1 figure
Abstract:Sequential user behavior modeling is widely adopted in industrial recommender systems; however, significant gaps remain in financial services, where pre-login web interactions and authenticated in-app experiences differ drastically. Specifically, pre-login web users typically explore new products, whereas logged-in app users focus on account servicing. Due to the challenge of cross-channel entity resolution (e.g., matching anonymous web sessions to authenticated mobile accounts), web-based intent signals remain underutilized for post-authentication personalization. Existing methods for capturing web-based intent are often ad-hoc and narrow, lacking the flexibility to support both quantitative downstream recommendations and qualitative understanding at scale. In this work, we propose a scalable and dual-purpose intent prediction framework for web-based interactions and demonstrate its applicability for personalization. Our approach transforms raw web clickstreams into two outputs: a self-supervised Transformer encodes multi-modal clickstreams into a compact session embedding, while an LLM-based taxonomy generation and distillation pipeline produces interpretable intent labels. Our system demonstrates that self-supervised clickstream representations combined with LLM-distilled taxonomies can jointly serve quantitative tasks and qualitative understanding in production: on the mobile homepage tile ranking task, the session embedding improves macro Recall@1 by 1.88% and reduces Log Loss by 13.38% over production baselines. On the user conversion prediction task, the embedding outperforms the LLM labels by 4.3% on micro F1, while the distillation layer delivers interpretable labels at ultra-low latency with only a 7% performance drop.
[IR-19] Lacuna: A Research Map for Machine Learning
链接: https://arxiv.org/abs/2606.26246
作者: Martin Weiss,Miles Q. Li,Alejandro H. Artiles,Yacine Mkhinini,Chris Pal,Hugo Larochelle,Nasim Rahaman
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 14 pages, 3 figures. Preprint
Abstract:Lacuna is a research map for machine learning that uses LLMs to turn papers and scholarly metadata into markdown summaries, concept elements, research directions, and research proposals. Each item keeps links to the primary source records and papers that support it. We release the map with web, markdown, and MCP interfaces. Across LitSearch, Multi-XScience-CS/ML, and ScholarQA-CS-ML, Lacuna outperforms OpenScholar with the strongest gains on LitSearch retrieval (Recall@10 0.538 vs. 0.424 for OpenScholar v3). We also evaluate Lacuna Deep Research, a multi-stage report agent over the map, on 25 ReportBench-ML survey tasks: Lacuna Deep Research reaches 0.052 citation F1, 0.339 citation precision, 99 expert-reference hits, and 7.82/10 RACE report quality, while GPT-Researcher reaches 0.039 F1, 0.290 precision, 72 hits, and 5.24/10 RACE.
[IR-20] Reducing Redundancy in Whole-Slide Image Patching for Scalable Indexing and Retrieval
链接: https://arxiv.org/abs/2606.26157
作者: Jialiang Geng,Ghazal Alabtah,Saghir Alfasly,Wataru Uegami,H.R.Tizhoosh
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid growth of digital pathology has created an urgent need for efficient indexing and retrieval of whole slide images (WSIs). This need is intensified by emerging generative AI workflows, particularly retrieval-augmented generation (RAG), which require dependable similarity search to support high-stakes clinical decision-making. Yet the substantial cost of high-performance storage limits the scalability and accessibility of WSI indexing for many healthcare institutions. Consequently, methods that can reduce storage demands while preserving retrieval accuracy have become a critical research priority. We propose ARReST (Antithetical Redundancy Reduction Strategy), a principled oppositional framework that leverages redundancy across dissimilar tissue classes to markedly decrease the number of patches that must be indexed from each WSI. Instead of eliminating only within-class duplicates, ARReST identifies antithetical patches-those whose representations contribute minimally to cross-class discrimination-and prunes them from the searchable archive. This targeted reduction substantially compresses the index without sacrificing morphological diversity or retrieval fidelity. By minimizing superfluous patch representations, ARReST reduces storage footprint, lowers computational overhead, and accelerates similarity search across large pathology repositories. Extensive experiments on TCGA repository (The Cancer Genome Atlas with 21 organs) demonstrate that ARReST achieves significant index compression while maintaining competitive retrieval performance. The observed storage savings of 3% to 60% (14% \pm 13%) can be reliably achieved without compromising retrieval performance for many organs. The proposed strategy enables scalable, cost-efficient WSI indexing and is well-suited for next-generation retrieval-driven clinical AI systems.
人机交互
[HC-0] AI Healthcare Chatbots as Information Infrastructure: A Large-Scale Study of User-Reported Breakdowns
链接: https://arxiv.org/abs/2606.27302
作者: Muhammad Hassan,Ramazan Yener,Ece Gumusel,Masooda Bashir
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:AI healthcare chatbots are increasingly used to support health information seeking and self-management, yet their performance and impact on users remains to be studied. This study examines over 15,000 user reviews from 59 AI healthcare chatbot apps to explore how these systems function in everyday informational and emotional contexts. Topic modeling and interpretive analysis identify three recurring breakdowns: access barriers and service unreliability, user experience and interaction quality, and billing and customer support issues. Privacy and security concerns are associated with the most negative experiences. By framing AI healthcare chatbots as information infrastructures, our findings highlight how failures in access, usability, and trust affect users, offering actionable insights for designers, policymakers, and information professionals aiming to improve digital health systems.
[HC-1] Reading the Same Data Differently: Interpretive Labor Across System Boundaries in Electronic Monitoring
链接: https://arxiv.org/abs/2606.27301
作者: Yibo Meng,Bingyi Liu,Zhiqi Gao,Shuai Ma,Hongyu Zhou
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Electronic monitoring (EM) systems are increasingly used in community corrections to enforce spatial, temporal, and behavioral rules through continuous sensing. While prior work has examined EM as a criminal justice tool or as a mechanism for compliance, less is known about how sensed data become meaningful in everyday practice. This poster examines EM as a dual-sided sensing system in which supervised individuals and authorities reason about the same data stream from different positions. Based on semi-structured interviews with 26 supervised individuals and 12 authorities in China’s community corrections system, we show that supervised individuals infer system logic from outcomes with limited visibility into how data are interpreted, while authorities reconstruct behavior from ambiguous traces using contextual knowledge, professional experience, and institutional procedures. We call this structural divergence interpretive misalignment. It emerges from asymmetric access to data, context, and reasoning processes, and it shapes behavior through probing, strategic adaptation, over-compliance, disengagement, and contestation. We contribute a CSCW account of continuous sensing as distributed interpretive work and identify design opportunities for making data-to-decision processes more legible, contestable, and accountable across system sides.
[HC-2] “Everyone Says Them”: Deception Typologies Probabilistic Trust and Grassroots Safety Knowledge Among Gay Dating App Users in China
链接: https://arxiv.org/abs/2606.27284
作者: Yibo Meng,Lyumanshan Ye,Yingfangzhong Sun,Bingyi Liu,Huidi Lu,Xiaolan Ding
类目: Human-Computer Interaction (cs.HC)
备注: Accept to CSCW 26 EA
Abstract:Gay dating applications have become critical platforms for sexual minority men to seek relationships and community, yet they also expose users to deceptive interactions that remain underexplored in HCI and CSCW research. This study examines how gay male users in China experience, identify, and respond to deception on dating applications. Through semi-structured interviews with 22 participants across platforms including Blued, Aloha, Fanka, and Soul, we make three contributions. First, we identify a typology of deceptive practices extending beyond profile misrepresentation to encompass relational, emotional, financial, and commercial forms of deception. Second, we document the layered, probabilistic verification strategies users develop through long-term platform use, showing that trust assessment operates as a multi-signal, provisional process rather than a binary judgment. Third, we demonstrate that risk recognition is a collaborative practice shaped by the circulation of experience, the abstraction of recurrent tactics, and the codification of shared rules within the community.
[HC-3] Beyond Objects
链接: https://arxiv.org/abs/2606.27258
作者: Daniel Jackson
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC); Programming Languages (cs.PL)
备注:
Abstract:A core principle of object orientation – that the functionality of a system can be partitioned amongst objects that correspond to individuals in the problem domain – has influenced how software has been specified, designed and implemented for more than fifty years. Later developments in software engineering sought to build on this principle. But in fact this partitioning is neither natural nor straightforward, and the problems that these later developments sought to mitigate – the fragmentation and conflation of functionality – were often, in fact, the inevitable consequences of this founding principle. An easier path to addressing these problems therefore starts by going back, abandoning object orientation, and replacing it with an alternative approach that decouples the individuals of the problem domain from the modules that partition functionality.
[HC-4] From Celebrities to Anyone: Characterizing AI Nudification Content Technology and Community Dynamics on 4chan
链接: https://arxiv.org/abs/2606.27234
作者: Chi Cui,Yixin Wu,Yang Zhang
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 22 pages, 13 figures, 2 tables
Abstract:AI nudification uses generative models to create synthetic non-consensual sexually explicit imagery (SNEACI) of real individuals. Prior work has examined dedicated nudification platforms and model repositories, finding that most targets are female celebrities. However, the anonymous content community, where SNEACI is actively requested, generated, and exchanged, remains unexplored. In this work, we present a large-scale study of AI nudification in the wild, identifying 24,105 SNEACI items. We find a significant shift in target demographics: non-celebrity individuals now account for 55.8% of targets, compared to only 4.7% in prior studies, indicating that AI nudification has expanded from targeting public figures to increasingly harming individuals within users’ own social circles. Meanwhile, open-source models dominate production, with Stable Diffusion family generating 42.7% of images and Wan generating 66.5% of videos, all driven by thousands of shared fine-tuned models and accessible tutorials. Yet the ecosystem runs on a small cohort of active producers, with the most prolific producing 780 items, drives community engagement, shapes target demographics, and disseminates technical knowledge that lowers barriers for new producers. Our work provides an empirical understanding of how AI nudification operates in the wild, revealing the mechanisms that sustain this ecosystem and highlighting the urgent need for interventions in platform governance, technical safeguards, and affected individual protection.
[HC-5] Behind the Mask: A Taxonomic Analysis of Activities in Online Social Networks
链接: https://arxiv.org/abs/2606.27111
作者: Debora F De Souza,Gabriela Beltrao,Berta Chulvi,Sergio Dantonio,Mehmet Gokay Ozerim,Javier Torregrosa,Adrian Giron,Angel Panizo,Pablo Miralles Gonzalez,Helena Liz,Javier Huertas Tato,Sonia Sousa,Alejandro Martin,Monika Maciuliene,David Camacho
类目: Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注:
Abstract:The broadcast of disinformation in online social networks (OSN) is a growing concern examined across several disciplines, including human-computer interaction (HCI). The pervasive issue has been prompting novel approaches to identify the malicious actors behind the dissemination of deceptive and fabricated content. Analyzing the characteristics and activities of these actors, we designed a taxonomy informed by collaboration with subject matter experts (SMEs) and a review of the academic literature. Our study explores how to distinguish the characteristics, activities, and strategies of malicious actors on OSN and examines how they contribute to the spread of disinformation. We describe the design process and the application of the taxonomy in a case study analyzing anti-migration discourse in social media channels, and reflect on its potential to aid researchers and practitioners in the responsible design of network systems.
[HC-6] Urban Context and Travel Experience Events: An Exploratory Comparison of Two German Cities
链接: https://arxiv.org/abs/2606.27077
作者: Marie Güntert,Esther Bosch
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:The presented study investigates events influencing public transportation experience in both urban (Hamburg) and rural (Tuttlingen) areas in Germany, with the aim of identifying events that affect travel experience and as a result travel behavior. Using a mobile application, 21 participants in Tuttlingen and 70 participants in Hamburg tracked everyday trips, providing real-time evaluations of travel experiences along with situational data. Multi-level regression analyses were applied to assess the impact of events such as punctuality, capacity offer, information about public transportation and others on the ontrip experience. Results indicate that a sufficient public transportation capacity offer has the strongest positive effect in Tuttlingen, whereas a lack of punctuality and low personal well-being have the strongest negative effects. In Hamburg, a lack of punctuality and a negative information event have the largest impacts. These identified effects provide a foundation for decision-making and measures to improve local public transportation.
[HC-7] Floor Raiser or Ceiling Limiter? Differential Storytelling Outcomes with a Child-Centric GenAI System Across Individual Differences
链接: https://arxiv.org/abs/2606.27067
作者: Min Fan,Wanqing Ma,Xinyue Cui,Xiaolu Dai,Shengyu Huang
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Generative AI (GenAI) holds promise for democratizing creative literacy, yet whether it benefits all children equally remains unclear. Using a child-centric GenAI storytelling system for children aged 7-12, we conducted a mixed-methods within-subjects experiment (N = 40, Grades 2-6) comparing GenAI-assisted and traditional storyboard conditions. Three findings emerged. First, the GenAI-assisted condition was associated with a floor-raising convergence pattern, with the quality gap narrowing by 83.5%, driven by lower-end support and upper-end constraint mechanisms. This convergence was dimension-selective, improving creativity and richness while leaving coherence and narrative structure tied to baseline performance. Second, younger children more often selected semantically distant keywords while older children preferred semantically closer ones, although engagement orientation varied across individuals regardless of age. Third, image regeneration was positively associated with structural quality dimensions, though this association was attenuated after baseline control. We propose mechanism-contingent scaffolding as a design principle for adaptive GenAI storytelling systems serving diverse children.
[HC-8] What Holds Back Brain-Computer Interfaces? Uncovering Challenges and Opportunities in BCI-controlled Games for Cerebral Palsy Rehabilitation
链接: https://arxiv.org/abs/2606.26951
作者: Bastian Ilsø Hougaard,Kirstine Schultz Dalgaard,Kirstine Johanne Stougaard Klebæk,Hendrik Knoche,Mads Jochumsen
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Brain-computer interfaces (BCIs) offer promising avenues for cerebral palsy (CP) rehabilitation at home and in the clinic, using games that promote engagement and sustained training effort. Nonetheless, the design constraints of BCI-based CP rehabilitation remain unclear, especially how individuals with CP experience a sense of control through BCI, and how they experience computer-mediated game assistance. To address this gap, we present preliminary clinical and user perspectives on BCI-based CP rehabilitation, drawing on in-clinic insights from a CP therapist and experiential accounts from ten individuals with CP engaging with BCI game prototypes. Sporadic help in BCI games eased monotony, but also fostered doubts regarding agency. The therapist saw BCI rehabilitation as complementary to traditional training, facilitating the transition from playful exercises to autonomous, self-managed training. We outline key challenges and opportunities to inform and empower further design and research of BCI training for CP.
[HC-9] Continuous Behavioral Synthesis for Adaptive Health Dashboards: An LLM -Mediated Architecture Integrating Explicit Preference Spatial Reorganization and Attention Allocation Signals
链接: https://arxiv.org/abs/2606.26937
作者: Tiziano Santilli,Mina Alipour,Mahyar T. Moghaddam
类目: Human-Computer Interaction (cs.HC)
备注: 33 pages, Accepted EICS2026 Patras, Greece
Abstract:The engineering of adaptive user interfaces has traditionally relied on either rule-based systems encoding designer intuitions about user needs or machine learning approaches requiring substantial historical data before achieving effective personalization. We present a technical architecture that leverages Large Language Models as behavioral synthesis engines to enable immediate adaptation from sparse, heterogeneous user signals. Our system integrates three distinct behavioral channels, i) explicit micro-feedback on individual interface elements, ii) spatial priority inferred from manual widget reorganization through drag-and-drop interaction, iii) and attentional investment measured through dwell time during hover events, within a structured prompt engineering framework that continuously regenerates dashboard layouts while maintaining explanatory coherence. The architecture addresses the technical challenge of translating low-level interaction patterns into high-level design decisions through a layered prompt construction methodology that separates temporal context determination, behavioral signal extraction, explicit preference enforcement, and user profile synthesis. The approach combines manually specified behavioral interpretations and temporal heuristics with LLM-mediated synthesis, enabling the reconciliation of multiple simultaneous signals that would be difficult to encode through explicit rules alone. We demonstrate the system through an instantiation in the personal health monitoring domain, including an analytical evaluation of adaptation behavior across multiple scenarios and a working implementation managing fourteen distinct health metrics across seven widget visualization modalities. The evaluation compares profile-driven initialization, multi-signal behavioral adaptation, and presents the resulting interfaces through representative post-adaptation screenshots.
[HC-10] Game Changers: Designing and Measuring Dynamic Feedback To Help Users Self-Regulate in a VR Pointing Game
链接: https://arxiv.org/abs/2606.26925
作者: Bastian Ilsø Hougaard,Scott Bateman,Iris Brunner,Lars Evald,Hendrik Knoche
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:The way games dynamically convey information through feedback is critical to players’ ability to perform, learn, and improve. However, it is poorly understood how performance metrics impact player performance and perception in core game tasks like pointing or steering. With a virtual reality pointing task we systematically explored how three performance metrics driving the feedback affected players when rewarding short completion times, straight movements, or high peak speed. across different points in time - continuously, at end-of-action, or at end-of-task. On average the dynamic feedback helped people point more straight and faster, while for others it had small or opposite effect. The study quantitatively compared dynamic feedback across three forms with the metrics driving the form as the intended locus of quantitative comparison. Our work improves game designers basis for crafting dynamic feedback by helping them know when to employ feedback schemes that align with desirable game performance objectives.
[HC-11] Optimizing Human-Machine Interface for Real-Time AI Support in the Operating Room: the CVS Copilot
链接: https://arxiv.org/abs/2606.26886
作者: Lorenzo Arboit,Nicolas Chanel,Aditya Murali,Pietro Mascagni,Nicolas Padoy
类目: Human-Computer Interaction (cs.HC)
备注: 13 pages, 3 figures
Abstract:Artificial intelligence (AI) systems for automated Critical View of Safety (CVS) assessment in laparoscopic cholecystectomy are nearing clinical translation. Beyond algorithmic performance, clinical safety and effectiveness depend on the quality of the human-machine interface (HMI). This work examines how AI-generated predictions should be presented and controlled intraoperatively. Seventeen surgeons, including residents, attending surgeons, and professors, took part in a mixed-methods, user-centered design study to optimize an intraoperative HMI for AI-assisted safe laparoscopic cholecystectomy. Interviews explored interaction modalities, timing of assistance, visualization strategies, and control mechanisms across surgical roles, and were analyzed using reflexive thematic analysis and human-factors heuristics. Most surgeons (16/17) supported the use of AI for intraoperative decision support while rejecting autonomous decision-making. Attendings preferred minimal AI feedback at decisive moments (13/14), whereas residents favored optional guidance (3/3) with confidence indicators and on-demand anatomical overlays. Across interviews, surgeons consistently prioritized visual, surgeon-controlled, minimally intrusive displays, with the strongest support for a minimal overlay (16/17) and on-demand anatomical segmentation (13/17). Recurrent concerns included persistent overlays, haptic feedback, and numeric confidence displays, although these were not uniformly raised across the cohort. These findings informed the design of CVS Copilot, a surgeon-controlled, role-adaptive HMI that provides AI-based CVS assessment with minimal default visualization and optional overlays.
[HC-12] MedSWFlow: An Open-Source LLM Workflow for Drafting Medical Social Work Case Plans
链接: https://arxiv.org/abs/2606.26884
作者: Yulin Mao,Shiyu Li,Shuping Song,Yuling Zhang,Yajun Song
类目: Human-Computer Interaction (cs.HC)
备注: 26pages, 8tables, 2figuers
Abstract:We present MedSWFlow, an open-source, model-agnostic LLM workflow for drafting medical social work case plans. The framework translates professional case-planning tasks into six stages: assessment, problem analysis, goal setting, intervention planning, risk anticipation, and planned effect evaluation. Drawing on established social work and behavioral frameworks, MedSWFlow standardizes case inputs, builds structured case profiles, and generates reviewable assessment forms and service plans through staged prompting. The system is released as an open-source research framework for reproducible case-plan generation across LLM providers. Outputs are intended as practitioner-reviewed drafts rather than final service decisions. Source code: this https URL.
[HC-13] A bit of chaos and madness: The AI Assessment Scale and the work of assessment reform
链接: https://arxiv.org/abs/2606.26729
作者: Mike Perkins(1),Darius Postma(1),Jasper Roe(2),Susan Sisay(3),Craig Holdcroft(3) ((1) British University Vietnam, Vietnam, (2) Durham University, United Kingdom, (3) University of Staffordshire, United Kingdom)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Generative artificial intelligence (GenAI) has intensified pressure on universities to redesign assessment while maintaining integrity, equity, and validity. Structured frameworks such as the Artificial Intelligence Assessment Scale (AIAS) offer one response, but evidence of how staff experience their implementation remains limited. This qualitative study examines AIAS implementation at a private international university in Vietnam and a public university in the United Kingdom. Data from five focus groups with 30 academic staff were analysed using hybrid thematic analysis, with Critical AI Literacy used as a sensitising concept. Six themes were developed: recognising and integrating AI, facilitating conditions, building capacity, pathways to adoption, ethics in practice, and reframing pedagogy. Staff valued the AIAS as a shared language for legitimising GenAI use, clarifying boundaries, and prompting reflection on assessment design. However, implementation was shaped by governance, tool access, staff confidence, workload, integrity concerns, disciplinary context, and alignment with learning outcomes. The findings show that the AIAS could prompt authentic assessment design and student engagement, but may become a compliance layer when disconnected from learning outcomes, disciplinary context, and staff capacity. This study contributes empirical evidence on the institutional conditions through which GenAI assessment frameworks move from policy adoption to pedagogical enactment. Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2606.26729 [cs.HC] (or arXiv:2606.26729v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2606.26729 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-14] Modeling Adaptive Visual Search in Semantically Hierarchical Layouts
链接: https://arxiv.org/abs/2606.26725
作者: Saku Sourulahti,Jussi P. P. Jokinen
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:This paper introduces a computational cognitive model to investigate how information grouping impacts visual search, a key consideration in user interface design. The model uses computational rationality to view user behavior as an adaptation to cognitive and task constraints. Our work highlights that humans use hierarchical task representations, exploiting semantic and visual structures to improve search efficiency within the constraints of the visual system. We validate this model with data from two human studies focused on visual search and semantic categorization, demonstrating that semantic grouping improves search performance when it aligns with spatial grouping. Our model replicates task durations and eye movement patterns. By improving understanding of how hierarchical memory structures are utilized in human cognition, the model extends previous visual search models. We showcase our model in the rapid prototyping and evaluation of semantic visual groupings within user interface wireframes, suggesting a pathway toward applications in more complex, real-world interface design.
[HC-15] Knowledge-Based Pull Requests: A Trusted Workflow for Agent -Mediated Knowledge Collaboration
链接: https://arxiv.org/abs/2606.26721
作者: Xinyu Zhang,Weiwei Sun
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注:
Abstract:AI coding agents are changing the bottleneck in software collaboration: code is increasingly cheap, while understanding intent, negotiating scope, and governing long-term project responsibility remain costly. This paper proposes \emphKnowledge-Based Pull Requests (KPR), a trusted workflow for agent-mediated software collaboration across trust boundaries, including open source, enterprise, vendor, contractor, and customer-driven settings. In KPR, an external collaborator’s local code, tests, and cleaned agent interaction trace are treated as knowledge sources rather than as the default merge candidate. Agents distill these sources into a human-confirmed knowledge package and render it into reviewer-facing forms such as design memos, risk checklists, test plans, or implementation briefs. A project-owned inner trusted coding agent then regenerates candidate code inside the receiving project’s environment under repository context, engineering conventions, tests, and security policy. KPR therefore separates two decisions that traditional pull requests often collapse: whether the knowledge should enter the project, and whether a particular implementation should be merged. We contribute the KPR workflow, a candidate artifact schema, a cost-accounting view, a collaboration gateway architecture, a minimal controlled simulation pilot over seven merged public pull requests, and an evaluation agenda. The pilot shows that KPR packages can be instantiated from real PR material and stress-tested under description ablation, diff ablation, and synthetic poisoned-patch conditions. We position KPR as an empirically testable workflow: its value depends on whether auditable extraction, transformation, and project-side regeneration reduce the cost of understanding and reworking high-context external changes.
[HC-16] From Content to Strategy: Understanding the Motivations Processes and Impacts of AI-Guided Communication
链接: https://arxiv.org/abs/2606.26672
作者: Chang Wan,Angel Hsing-Chi Hwang
类目: Human-Computer Interaction (cs.HC)
备注: 20 pages, 1 figure
Abstract:Artificial intelligence-mediated communication (AI-MC) is conceptualized as applying AI to augment or generate message content (Hancock et al., 2020). However, advances in generative AI have expanded its use beyond generating content to guiding individuals’ communication strategies, that is, AI-guided communication, yet theoretical and empirical understandings of this emerging use pattern and its consequences remains limited. To address this gap, this study conducted 26 in-depth interviews with individuals who have used AI to develop their communication strategies. Findings suggest participants strongly preferred using AI to analyze challenging scenarios in close relationships, because it fostered self-reflection, eased emotions, prevented conflict escalation, offered multiple perspectives, and provided a safe, nonjudgmental space for self-disclosure. Participants also stated that AI-guided communication enhanced their empathy and communication skills, though some voiced self-doubt and worried about losing their uniqueness. Views on long-term relational impact were mixed, depending on perceived usefulness of AI for resolving short-term interpersonal challenges.
[HC-17] Invisible Impact of Empathy on Behavioral Change: Isolating the Effect of Empathy in Long-term Physical Activity Coaching Chatbot Interactions
链接: https://arxiv.org/abs/2606.26641
作者: Li Siyan,Kai-Hui Liang,Shopnil Shahriar,Yilin Ye,Shiyoh Goetsu,Wei-Wei Du,Masahiro Yoshida,Tsunayuki Ohwa,Xuhai Xu,Zhou Yu
类目: Human-Computer Interaction (cs.HC)
备注: Shortened version of paper accepted into CUI Short paper and WIP
Abstract:Current dialogue systems, powered by large language models, often treat empathy as essential without assessing its true impact, especially in behavior change, where motivation and adherence often depend on subtle user-chatbot dynamics. We examine this assumption by building three WhatsApp physical-activity (PA) coaching chatbots that differ only in empathy level and evaluating them in a six-week within-subject study (N = 13). Participants struggled to distinguish between the empathy conditions, and the non-empathetic version was often rated as more engaging and useful. However, higher-empathy variants were still associated with a larger overall average increase in step counts and faster improvement in intention to follow advice. These results suggest empathy’s role is nuanced: it may be hard for lay users to identify explicitly, but it can still shape motivation and trust that support sustained change. We interpret this pattern through the Elaboration Likelihood Model’s peripheral route. We highlight design implications for building next-generation PA coaching chatbots that balance effectiveness with human-like connection.
[HC-18] Reviving Reflection-in-Action: Instilling Designerly Thinking in AI-Supported Ideation through Multimodal Prompting
链接: https://arxiv.org/abs/2606.26626
作者: Samangi Wadinambiarachchi,Jenny Waycott,Greg Wadley
类目: Human-Computer Interaction (cs.HC)
备注: ACM Creativity and Cognition 2026 conference (CC’26)
Abstract:Current AI-powered creativity support tools (AI-CSTs) primarily use text prompting to generate solution-oriented outputs. However, the potential value of multimodal prompting in designer-AI interaction, specifically the introduction of productive friction to encourage iteration and reflection, has not been fully explored. To address this, we developed SketchifAI, a prototype AI-CST, and evaluated it with design students. In a mixed-methods, within-participants study, we examined how different input modalities (text, sketch, and sketch-plus-tags) affected design students’ perceived ability to express their intent, their perception of creativity support, and their divergent thinking performance. Our preliminary findings suggest that the sketch modality tended to enhance fluency, with inconclusive evidence for differences in variety, originality, or quality compared to text modality. Yet, paradoxically, participants showed a strong preference for text prompting. We discuss how AI tools might be designed to reintroduce reflection-through-sketching, ensuring that designer-AI interaction supports, rather than erodes, essential design skills in students.
[HC-19] HiLSVA: Design and Evaluation of a Human-in-the-Loop Agent ic System for Scientific Visualization
链接: https://arxiv.org/abs/2606.26614
作者: Kuangshi Ai,Patrick Phuoc Do,Chaoli Wang
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:
Abstract:Large language model (LLM) agents enable natural language interaction for scientific visualization (SciVis). Still, prior systems have essentially prioritized autonomy over human analytical control, thereby limiting transparency and human oversight. We present HiLSVA, a human-in-the-loop agentic system that supports mixed-initiative SciVis workflows. HiLSVA integrates a plan-first multi-agent architecture with explicit human oversight, stepwise provenance tracking, and learn-at-test-time adaptation from user feedback. The system supports fluid handoff between humans and agents through both natural language and direct manipulation of visualizations, while sandboxed execution ensures safe, reproducible workflows. In doing so, HiLSVA reframes agentic SciVis as a collaborative process that augments, rather than replaces, human analytical reasoning. We evaluate HiLSVA through representative case studies and a controlled user study with twelve participants of varying expertise across multiple autonomy settings. Results show that mixed-initiative interaction improves task completion, user control, and workflow transparency across different levels of user expertise, while revealing a tradeoff between execution efficiency and human oversight. These findings highlight the importance of human-centered design in agentic SciVis and guide the development of future collaborative visualization systems. We encourage readers to explore our demo video, case studies, and source code at this https URL.
[HC-20] An exploratory behavioral and electroencephalographic study of artificial intelligence-assisted learning modes in high school students
链接: https://arxiv.org/abs/2606.26579
作者: Kashika Khurana,Ally Liew
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: 16 pages, 11 figures
Abstract:As artificial intelligence (AI) is rapidly integrating into education, concerns have emerged regarding its potential implications on cognitive engagement and problem-solving behavior. However, existing research largely treats AI exposure as a binary condition (AI vs. no-AI), with limited differentiation between interaction modalities and post-exposure effects. This study investigates whether distinct AI interaction modes (Tutor, Collaborator, Solver) influence frontal EEG spectral activity. Electroencephalography (EEG) data and quantified behavioral metrics were recorded from 48 study participants (24 males, 24 females; ages 14-18) across two counterbalanced quizzes in a within-subject design. Statistical analyses included Friedman tests, repeated-measures ANOVA, paired t-tests, and effect size calculations. Behavioral changes were mathematically analyzed in an observation matrix of three characteristics -Initiation, Processing, and Stress-measured on an ordinal scale. Each mode showed significant differences in all three behavioral measures. Descriptive EEG patterns in AI interaction mode were observed, and the possibility of short-term carryover effects of AI was explored. Although the EEG data did not reach statistical significance, the patterns observed across the three AI interaction modes warrant further investigation. This study provides preliminary behavioral evidence and investigative electrophysiological observations, exploring possible AI-interaction-mode-based differences in neural activity and behavior, while establishing a replicable framework for future human-AI interaction studies.
[HC-21] Pingquanqi (Equalizer): A Cross-Domain Sociotechnical Framework for Human-Agent Interaction Governance
链接: https://arxiv.org/abs/2606.26573
作者: Yu Wang
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 31 pages
Abstract:LLM agents are transitioning from experimental tools to permanent infrastructure – a computational layer as enduring as the electrical grid. Like any infrastructure, they carry a cost chain from physical capital through enterprise investment to user consumption, ending at the user’s most irreplaceable resource: lifetime. When unoptimized, this chain leaks, consuming user lifetime without adequate compensation. This paper proposes Pingquanqi (Equalizer), a cross-domain sociotechnical framework for Human-Agent Interaction Governance (HAIGF). Its product form is an Agent framework-level embedded design specification, analogous to WCAG for web accessibility, whose goal is not to be purchased but adopted as a standard. Pingquanqi consists of four integrated components deployable as native middleware: (1) a user-state discrimination model enabling proactive knowledge leveling, (2) a Bayesian progressive stop-loss rule capping per-session interaction cost, (3) controlled friction mechanisms breaking self-reinforcing dependency loops, and (4) Lsteal, a transparency metric rendering token-to-lifetime cost conversion visible. A fifth mechanism, reflective summarization (F5), enables guided cognitive recollection. The framework is grounded in cross-cultural philosophy: Mao’s epistemology of practice (On Practice, 1937) provides the basis for cross-session knowledge accumulation; Wang Yangming’s unity of knowledge and action (zhi xing he yi, c. 1509) illuminates Lsteal’s root – knowing without acting is incomplete; and Hegel’s unity of theory and practice demonstrates cross-traditional convergence. This paper argues Pingquanqi’s primary economic beneficiary is the enterprise deploying Agent services – through reduced wasted computation, improved user satisfaction, and sustained subscription revenue – with individual user benefit as the natural downstream consequence.
[HC-22] Co-Designing Community-Centered AI Education for Adults: A Midwestern Case Study
链接: https://arxiv.org/abs/2606.26565
作者: Yao Lyu,Leonymae Aumentado,Holden Winton,Jared Lee Katzman,Sparkle Berry,Zachary Rowe,Kimberly Sanders,Tawanna R. Dillahunt
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to CSCW 2026 Poster
Abstract:Artificial Intelligence (AI) education is increasingly important, yet adults outside higher education receive less attention. We report a case study of an AI education session with 54 adults (48 in-person and 6 virtual) in a predominantly African American community on the east side of a major Midwestern city. We ask: “What does AI education for adults outside formal educational systems look like in practice?” and “What does this AI education session reveal about AI literacy at the community level?” Through a co-designed session developed with community partners, we found that concerns about AI persisted but shifted to specific, locally grounded questions about AI design and deployment. We also discuss AI literacy from a community capacity perspective and argue for AI literacy frameworks grounded in local community contexts that strengthen community capacity.
[HC-23] Budget-Aware Keyboardless Interaction
链接: https://arxiv.org/abs/2606.26508
作者: Quang-Thang Nguyen,Gia-Phuc Song-Dong,Minh-Triet Tran,Trung-Nghia Le
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: SOICT 2024
Abstract:Interacting with computers typically relies on traditional input devices such as keyboards, mice, and monitors, which can be cumbersome for users seeking greater mobility. Virtual keyboards have been explored to address these limitations, but they often involve complex setups or expensive equipment. This paper proposes a novel virtual keyboard system that leverages only a standard camera and a paper with a printed keyboard layout. Unlike previous methods requiring complex calibration or special lighting conditions, our approach can work on standard environment using modern computer vision technologies. Combining modern segmentation and detection models with traditional image processing algorithms, we efficiently identify the keyboard region. Touch detection is performed using an algorithm analyzing the color of the user’s fingernail. Experiments demonstrated a promising results our proposed solution of keyboard and keystroke detection for practical applications. Participants attended our user study also found the proposed system interesting.
[HC-24] DanceDuo: Bridging Human Movement and AI Choreography
链接: https://arxiv.org/abs/2606.26507
作者: Gia-Cat Bui-Le,Tuong-Vy Truong-Thuy,Hai-Dang Nguyen,Trung-Nghia Le
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: SOICT 2024
Abstract:In recent years, advancements in deep learning and generative models have revolutionized music-driven dance generation. This paper introduces a novel platform, namely DanceDuo, leveraging diffusion models to generate AI-choreographed dance sequences synchronized with a variety of music genres, to encourage dancing practice. The system allows users to interact with AI by selecting music tracks, humanoid models, and importing personal dance videos for comparison, fostering a rich and engaging user experience. DanceDuo not only offers dance generation but also integrates human pose estimation models to provide users with insightful comparisons of their own performances with AI-generated sequences. We conducted a comprehensive user study, revealing that users found the interface intuitive, with particular praise for the dance comparison feature. Our DanceDuo contributes significantly to the integration of AI in dance choreography, offering novel avenues for both recreational and professional applications.
[HC-25] nyCNNDeep: Lightweight Attention-Based CNN for EEG Classification of Eye States and Sleep Deprivation
链接: https://arxiv.org/abs/2606.26506
作者: Thien Nhan Vo,Yen Nhi Tran Thi,Ngan Nguyen Xuan Phuong,Xuan-The Tran
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Sleep deprivation impairs vigilance and cognitive function, yet jointly identifying the sleep condition (normal vs deprived) and the eye state (open vs closed) from electroencephalography (EEG) remains underexplored. We address this four-class problem with TinyCNNDeep, a lightweight convolutional neural network that combines residual learning with a Squeeze-and-Excitation (SE) attention module. We convert short multi-channel EEG segments from five physiologically relevant channels (Fp1, Fp2, O1, Oz, O2) into 224x224 grayscale images through per-channel Z-score normalization, min-max scaling, and center padding, enabling 2D convolutions to jointly model inter-channel and temporal structure. On a 35-subject dataset recorded under normal-sleep and sleep-deprivation sessions, TinyCNNDeep attains a subject-wise mean accuracy of 83.69%, outperforming the strongest baseline (Random Forest with combined time-frequency features, 47.66%) by 36.03 percentage points, while three established EEG architectures (EEGNet, ShallowConvNet, DeepConvNet) operate near chance. Per-subject analysis quantifies inter-subject variability, and confusion-matrix inspection shows that residual misclassifications concentrate between eyes-closed states across sleep conditions. These results indicate that an image-based EEG representation paired with residual feature extraction and channel attention provides an accurate and computationally efficient framework for multiclass sleep-related EEG classification under a minimal electrode configuration.
[HC-26] Same Scrutiny More Time: Eye Tracking Insights into Reviewing LLM -Labelled Code
链接: https://arxiv.org/abs/2606.26505
作者: Ranim Khojah,Francisco Gomes de Oliveira Neto,Mazen Mohamad,Julian Frattini,Philipp Leitner
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注: Accepted at the 41st IEEE/ACM International Conference on Automated Software Engineering (ASE 2026)
Abstract:Modern software development increasingly involves the use of large language models (LLMs) to generate code. Despite their rapid advancement, LLMs remain prone to errors and hallucinations, emphasizing the importance of careful code inspection. However, in practice, developers’ trust in LLM-generated code and their willingness to review it thoroughly may differ from these recommendations. How developers actually behave when reviewing LLM-generated code remains largely unexplored. In this study, we conduct a Wizard-of-Oz experiment to examine how software engineers behave when code is explicitly labeled as LLM-generated during a code review task. We collect both behavioral data and participant feedback through eye-tracking and exit interviews. Combining Bayesian data analysis with qualitative analysis, we found that while the thoroughness of code review did not change for participants, they spent more time fixating on LLM-labelled code, indicating that the label itself influences attention. Practitioners also adapted their review strategy for LLM-labelled code by assessing the code based on specific criteria (e.g., logical correctness), or using the prompt to guide their review. These findings inform LLM-based tool design on labelling while incorporating the prompt as a software artifact. Our study reveals a gap between reviewers’ intentions and actual reviewing behaviour, highlighting the need for software companies to revisit their AI policies (particularly regarding LLM-assisted development) to better support developers in reviewing LLM-generated code.
[HC-27] Assistive Visual Cues for Visual Neglect Patients
链接: https://arxiv.org/abs/2606.26407
作者: Per Bjerre,Andreas Køllund Pedersen,Hendrik Knoche
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Previous research on exogenous and endogenous cues has shown how they direct attention and improve interaction speed and error rate in applications. However, most studies focus on people with normal sight. People suffering from visual neglect have difficulties attending to parts of the visual field. One treatment method calls for the use of strong visual cues to remind patients of their neglected area and help guide their attention to it. Therefore, we examine the effects of endogenous and exogenous cues on visual neglect patients. Our results showed that visual neglect patients perform better with endogenous cues, when targets are within their neglected area. In some cases, combining exogenous and endogenous cues improve performance further. However, the performance varies greatly between patients. Using one neglect patient as an example, we saw that the best endogenous cue had an average acquisition time of 3.5 seconds compared to 6.5 for the best exogenous. Combining exogenous and endogenous cues further improved acquisition time to 2.8 seconds.
[HC-28] Charting the Growth of Social-Physical HRI (spHRI): A Systematic Review Pipeline Augmented by Small Language Models
链接: https://arxiv.org/abs/2606.26382
作者: Mayumi Mohan,Ju-Hung Chen,Alexis E. Block
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: 5 pages, 3 figures, 2 tables, Companion Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction
Abstract:Social-physical human-robot interaction (spHRI) has grown rapidly across robotics, human-computer interaction, human-robot interaction, and haptics. Yet, fragmented terminology and inconsistent methodologies make systematic synthesis difficult. To support scalable review practices, we evaluated the extent to which small language models (SLMs; 1.5B parameters) can assist with title and abstract screening for a large spHRI systematic review. While no SLMs matched human reviewers’ performance, the models operated locally and screened papers orders of magnitude faster. The combined SLM ensemble identified 39 papers reviewers missed, representing 10.29% of the final relevant dataset. These results demonstrate that SLMs can augment, rather than replace, expert reviewers and make large-scale literature reviews accessible and sustainable.
[HC-29] Having Dog Ears “for Real”: Effects of Active and Passive Haptics on Embodying Non-Human Body Parts in VR
链接: https://arxiv.org/abs/2606.26364
作者: Omar A. Khan,Bao Han Trinh,Lee Lisle,Tiffany D. Do
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to the IEEE International Symposium on Mixed and Augmented Reality (ISMAR) 2026
Abstract:Embodying non-human body parts in VR is a prevalent practice among certain subcultures and is a personally important creative outlet to many individuals. However, the discrepant morphology between real and virtual bodies can decrease Sense of Embodiment (SoE). Haptic feedback can compensate by increasing SoE felt towards non-human body parts, but there is a literature gap in comparing the effects of different haptic modalities, and their combinations, on SoE. Through an online survey sent out to social VR communities (n = 63), we determined that animal ears are a commonly embodied and ecologically valid non-human body part to study. We then ran a 2x2 within-subjects user study (n = 28) with two independent variables: active haptics, delivered through vibrotactile gloves, and passive haptics, delivered through a physical headband, for when participants reach up to touch virtual dog ears appended to their avatar in VR. Our findings show that (1) passive haptics produced the strongest overall embodiment outcomes, (2) combining modalities reduced the benefits of passive haptics, and (3) SoE towards non-human body parts positively correlates with SoE towards the entire avatar. We discuss implications of our findings in various domains, and on embodiment literature.
[HC-30] he Governance Inversion Hypothesis: Why More AI Regulation May Produce Less Organisational Control
链接: https://arxiv.org/abs/2606.26117
作者: Victor Frimpong
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:This paper introduces the Governance Inversion Hypothesis (GIH) to explain a growing paradox in artificial intelligence (AI) governance: under conditions of increasing regulatory expansion and technological complexity, organisations may become more formally governed while simultaneously experiencing a decline in operational control over AI systems. Existing AI governance frameworks generally assume that stronger regulation improves accountability, oversight, and organisational control. This paper challenges that assumption by arguing that governance formalisation itself may contribute to the erosion of control in AI-intensive environments. Drawing on institutional theory, organisational governance research, accountability scholarship, and emerging AI governance literature, the paper develops a conceptual framework explaining how regulatory expansion may weaken operational authority through four interconnected mechanisms: authority fragmentation, symbolic governance expansion, externalisation of control, and authority paralysis. As governance systems become increasingly layered and procedurally dense, organisations may struggle to maintain coherent authority, technical visibility, escalation capability, and meaningful intervention power over opaque and externally mediated AI infrastructures. The paper extends institutional decoupling theory by introducing governance inversion as a structural condition in which governance expansion may actively undermine operational coherence rather than strengthen it. It concludes that the central risk in AI governance may not be the absence of governance structures, but the emergence of institutions that appear increasingly governed while progressively losing the capacity to govern effectively.
[HC-31] voxmap-studio: An open-source speaker diarization annotation tool with built-in cost instrumentation
链接: https://arxiv.org/abs/2606.26842
作者: Fumiaki Yamaguchi
类目: Audio and Speech Processing (eess.AS); Human-Computer Interaction (cs.HC); Sound (cs.SD)
备注: 3 pages, 2 figures
Abstract:Labeling speaker diarization data is costly, yet annotation tools rarely measure that cost. We present voxmap-studio, an open-source, React-based diarization annotation tool integrated with the pyannote-based diarization ecosystem. Its canvas is initialized by a fast stride-accelerated diarization engine so that the annotator corrects a hypothesis rather than drawing every speaker turn by hand, and the tool records annotation cost - typed edit-operation counts and time - as a first-class output, enabling quantitative comparison of how much different forms of assistance actually help. Export is gated on per-segment human confirmation and guarded by injected “phantom” attention checks, which prevent unverified automatic output from being released as ground truth. In a preliminary study on nine AMI audio files, unassisted manual annotation was the costliest and least accurate, and automatic initialization shifted the work from creating turns to correcting them; highlighting uncertain segments gave the lowest cost in our small sample. The tool and its instrumentation are open source.
计算机视觉
[CV-0] Ask Solve Generate: Self-Evolving Unified Multimodal Understanding and Generation via Self-Consistency Rewards
链接: https://arxiv.org/abs/2606.27376
作者: Ritesh Thawkar,Shravan Venkatraman,Omkar Thawakar,Abdelrahman Shaker,Fahad Khan,Hisham Cholakkal,Salman Khan,Rao Muhammad Anwer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Most unified large multimodal models (LMMs) that support both visual understanding and image generation still rely on curated post-training supervision, such as human annotations, preference labels, or external reward models. We ask whether a unified LMM can improve both abilities autonomously using only unlabeled images. We propose a self-evolving training framework with three internal roles: a Proposer that generates visual questions, a Solver that answers and evaluates them, and a Generator that synthesizes images. Training uses only self-derived consistency signals, without human annotations, preference labels, or task-trained external reward/judge models. To stabilize learning, we introduce Solver Token Entropy (STE), a continuous difficulty signal based on token-level prediction uncertainty that remains useful even when sample-level consistency becomes unreliable. For image generation, we design a multi-scale internal evaluation scheme that combines question-answer fidelity scoring with cycle-consistent captioning. This creates a solver-mediated coupling, where better visual understanding enables more reliable generation assessment and stronger internal training signals. The framework preserves the same role decomposition, reward logic, and training schedule across diffusion-based BLIP3o, rectified-flow BAGEL, and autoregressive VARGPT-v1.1 architectures, requiring only each backbone’s native prompting and generation interface. Across eight understanding metrics, our method consistently improves over the corresponding base models. On BAGEL, it achieves a +3.5% absolute gain on MMMU and improves GenEval image generation performance from 82% to 85% . Code and models are publicly released.
[CV-1] World Action Models Enable Continual Imitation Learning with Recurrent Generative Replays
链接: https://arxiv.org/abs/2606.27374
作者: Manish Kumar Govind,Dominick Reilly,Smit Patel,Hieu Le,Srijan Das
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Going beyond predicting robot actions, World Action Models (WAMs) can also generate future visual observations. We build on this generative capability to propose Recurrent Generative Replay (REGEN), a continual imitation learning framework that synthesizes pseudo-replay trajectories, enabling a robot policy to rehearse previously learned tasks without storing their original human demonstrations. During continual adaptation, REGEN recursively queries the WAM to synthesize pseudo-replay trajectories conditioned only on prior task instructions and current-task observations. Experiments in both simulation and real-world manipulation settings show that REGEN reduces catastrophic forgetting by up to 50% relative to sequential fine-tuning, while approaching the performance of privileged experience replay methods that require access to real replay data. Finally, we analyze the factors limiting generated replay, identifying long-horizon visual degradation and action-observation inconsistency as the primary bottlenecks. Our results establish WAMs as a promising foundation for continual robot learning without stored demonstrations.
[CV-2] Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models ECCV2026
链接: https://arxiv.org/abs/2606.27373
作者: Shravan Venkatraman,Ritesh Thawkar,Omkar Thawakar,Rao Muhammad Anwer,Hisham Cholakkal,Salman Khan,Fahad Khan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026
Abstract:Recently, self-evolving large multimodal models (LMMs) have received attention for improving visual reasoning in a purely unsupervised setting. However, multi-role self-play and self-consistency reward schemes in existing self-evolving LMMs optimize answer agreement without ensuring the decoder attends to visual content, relying instead on statistical language priors to produce self consistent outputs. This leads to a persistent failure mode we term visual under-conditioning, where the decoder relies on language priors rather than the image during generation, manifesting as insufficient attention to visual tokens. As a result, current self-evolving LMMs struggle on vision–language understanding tasks such as image captioning and visual question answering. To address this, we propose VISE (Visual Invariance Self-Evolution), a purely unsupervised self-evolving framework that directly regularizes the model’s visual conditioning policy through two complementary invariance-based rewards: a geometric invariance reward that enforces spatial consistency under known transformations, and a semantic invariance reward that penalizes evidence-agnostic generation by requiring the model to recognize the absence of evidence when predicted regions are perturbed. VISE operates within a single model without specialist roles, external reward models, or annotations, and is trained on raw unlabeled images. Experiments on 18 benchmarks demonstrate the efficacy of our approach. Using Qwen3-VL-2B as the base model, VISE achieves gains of +16.85 CIDEr on COCO and +19.66 CIDEr on TextCaps, reduces object hallucination by 5.0 Chair-I points, and generalizes across four model families and scales. Our code and models are available at this https URL
[CV-3] DnA: Denoising Attention for Visual Tasks
链接: https://arxiv.org/abs/2606.27372
作者: Ron Campos,Subhajit Maity,Xin Li,Srijan Das,Aritra Dutta
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The softmax activation in multihead attention (MHA) is the de facto standard for attention-based models in visual perception tasks. However, standard softmax can produce noisy attention patterns that dilute relevant features and degrade its performance. In this paper, we propose Denoising Attention or DnA, in which, first, a positive query identifies which image features belong to the correct class, and a negative query identifies closely associated but irrelevant image features. DnA then projects these interactions into two distinct subspaces with larger principal angles, promoting subspace separation and improved discriminability. Using a ViT-B backbone, our proposed DnA achieves an absolute gain of 0.8% on ImageNet-1K compared to the baseline. We further show improvements across multiple visual understanding tasks, including video understanding with video transformers (1.8%) and video LLMs (0.5%). Our extensive empirical analyses justify the design choices involving two interacting subspaces and the denoising effect of DnA.
[CV-4] Dont Settle at the Mode! Mitigating Diversity Collapse in Pretrained Flow Models via Feature Self-Guidance ECCV2026
链接: https://arxiv.org/abs/2606.27371
作者: Pradhaan S Bhat,Rishubh Parihar,Abhijnya Bhat,R. Venkatesh Babu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026. Project page: this https URL
Abstract:State-of-the-art flow models generate stunning images from text or image prompts. However, they suffer from diversity collapse when generating multiple samples under the same conditioning. Existing methods address this issue via either latent guidance, which has limited effectiveness, or sample selection, which relies on external reward models that incur significant inference-time overhead. In this work, we introduce an efficient, training-free self-guidance mechanism to mitigate diversity collapse without requiring additional reward models. Specifically, we disperse the internal features of the flow model during batch generation with feature self-guidance. Further, to keep the features close to the manifold, we introduce a manifold regularization step that projects these dispersed features back onto the data manifold, ensuring diverse generation without sacrificing alignment with the input conditions. Our method integrates seamlessly as a plug-and-play module into pretrained flow models, adding only a marginal inference cost. Experiments demonstrate significant improvements in diversity while preserving fidelity across several conditional flow models, including multi-step and few-step text-to-image, depth-to-image, and reference image generation.
[CV-5] PhysiFormer: Learning to Simulate Mechanics in World Space
链接: https://arxiv.org/abs/2606.27364
作者: Yiming Chen,Yushi Lan,Andrea Vedaldi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:We present PhysiFormer, a diffusion transformer for physically-plausible 3D object motion. Unlike video world models that operate in view-dependent pixel space, PhysiFormer represents objects as 3D meshes expressed in world coordinates. Given the initial vertex positions and velocities, as well as object material type, rigid or elastic, the model samples future vertex trajectories. While related neural physics approaches build on ad-hoc latent spaces or explicitly enforce rigidity and causality, PhysiFormer shows that excellent results can be obtained without any such inductive biases, by casting vertex trajectory prediction as a single denoising diffusion process directly in world coordinates. The probabilistic formulation captures uncertainty in the learned dynamics, enabling diverse plausible futures from initial conditions, making this framework potentially useful for applications with unobserved uncertainty. The model features attention factorised over time, space, and objects for efficiency, enabling permutation-invariant multi-object reasoning without needing explicit object encoding. Trained on over 100k simulated trajectories, PhysiFormer generates rigid and elastic mechanics, and generalises to mixed-material settings, unseen real-world geometries, and larger object counts. It substantially outperforms autoregressive baselines in trajectory accuracy, rigidity preservation, and momentum-based physical consistency. Our results position coordinate-space diffusion as a promising step toward view-invariant, geometry-aware world modelling for robotics, graphics, and physical design. Visualisations, code, and models are available at this https URL.
[CV-6] Error-Conditioned Neural Solvers
链接: https://arxiv.org/abs/2606.27354
作者: Haina Jiang,Liam Wang,Peng-Chen Chen,Min Seop Kwak,Seungryong Kim,Brian Bell,Jeong Joon Park
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:
Abstract:Neural surrogate models offer fast approximate mappings from PDE parameters to solutions, but they typically treat solving as a purely statistical task: once trained, they struggle to correct their own constraint violations and extrapolate beyond the training distribution. Recent hybrid methods promote physical correctness by targeting the PDE residual via gradient descent or Gauss–Newton steps, but inherit the compute cost and instability of the underlying classical optimizers. We show, theoretically and empirically, that numerically minimizing the PDE residual can be an unreliable proxy for reconstruction accuracy in ill-conditioned systems, explaining why these methods often do not make accurate predictions despite achieving low residuals. We propose error-conditioned Neural Solvers (ENS), built on a different principle: rather than an optimization target, the PDE residual field is passed as a direct input to the network at each iteration, enabling it to read the spatial structure of its own errors and learn an update policy to iteratively correct its predictions. Across four PDE families, ENS attains the highest prediction accuracy in the large majority of settings, with gains reaching 10\times on turbulent Kolmogorov flow, while avoiding the expensive compute cost of hybrid methods. ENS’s learned correction policy generalizes under distribution shift, including zero-shot parameter changes and cross-equation transfer, where its relative advantage is largest in the ill-conditioned regimes where residual minimization is least reliable. Project website: this https URL.
[CV-7] RayPE: Ray-Space Positional Encoding for 3D-Aware Video Generation
链接: https://arxiv.org/abs/2606.27345
作者: Minghao Yin,Jiahao Lu,Wenbo Hu,Wang Zhao,Shan Ying,Kai Han
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modern video diffusion transformers position their tokens through RoPE on the (u,v,t) axes – a description of the camera’s sampling grid that says nothing about the 3D structure of the scene. We observe that the geometric relation between two camera rays is captured by the Plucker reciprocal product, which is bilinear in the two rays – the same algebraic form as the dot product in Transformer attention. Building on this analogy, we propose RayPE, a positional-encoding extension that injects per-token 6D Plucker coordinates additively into the queries and keys of self-attention, with a query/key flip arrangement under which the symmetric identity configuration coincides exactly with the reciprocal product. The injection is additive, the resulting attention score decomposes into a content term, a geometry term, and two content and geometry cross-terms – all of which our experiments find individually necessary. To make the encoding stable across video data with heterogeneous camera-translation scales (SfM, deep SLAM, metric), we further decouple ray direction from moment magnitude, gate the encoding by a learned function of the log-magnitude, and apply RMSNorm to align it with the QKNorm-normalized content branch. The full module adds less than 0.1% parameters to a pretrained video DiT, is zero-initialized to start from the pretrained weights, and improves camera controllability, cross-frame 3D consistency, and overall video quality on a four-dataset training mixture.
[CV-8] SAM2Matting: Generalized Image and Video Matting ECCV2026
链接: https://arxiv.org/abs/2606.27339
作者: Ruiqi Shen,Guangquan Jie,Chang Liu,Henghui Ding
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026. Extended version. Project Page: this https URL
Abstract:Despite impressive advances in image matting, video matting remains challenging due to the inherent gap between high-level tracking, which requires frame-wise understanding, and low-level matting, which focuses on extremely fine-grained details. Existing methods attempt this with expensive and narrowly-scoped video matting datasets, which may limit out-of-domain generalization and compromise tracking robustness. We rethink the paradigm with SAM2Matting, a tracker-to-matting framework that advances VOS trackers to high-fidelity video matting. Specifically, it decouples the task by enhancing a foundational tracker (e.g., SAM2, SAM3) with a region-proposal bridge and dedicated matting heads, enabling the uncompromised tracker to handle temporal consistency while the matting components resolve fine-grained details. Notably, despite being trained only on images, SAM2Matting establishes new state-of-the-art performance on video matting, supports diverse prompt types, maintains strong temporal consistency, and demonstrates robust generalization across both human-centric and in-the-wild scenarios.
[CV-9] RoPEMover: Depth-Aware Object Relocation via Positional Embeddings
链接: https://arxiv.org/abs/2606.27332
作者: Ipek Oztas,Duygu Ceylan,Aybars Bugra Aksoy,Aysegul Dundar
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Moving an object in a single image requires geometry-consistent spatial rearrangement, including handling occlusions, revealing previously unseen regions, and maintaining coherent shadows and reflections. Existing approaches are not well suited to this setting and often fail to preserve such scene-level consistency. We address this problem by introducing a geometry-aware object motion method that operates directly on the positional representations of diffusion transformers. Our key insight is that rotary positional embeddings (RoPE) define a structured spatial field that can be explicitly manipulated to induce controlled motion. We extend 2D RoPE into a depth-aware formulation that encodes 3D spatial structure, enabling consistent object displacement and scene-aware updates. Our model is trained using synthetic data combined with a small set of real images via parameter-efficient fine-tuning. Despite minimal real supervision, it preserves object identity under large spatial displacements, generates plausible content in newly revealed regions, and consistently updates scene-dependent effects such as shadows and illumination. Experimental results on standard object motion benchmarks demonstrate state-of-the-art performance across all evaluation metrics.
[CV-10] Hallucination in World Models is Predictable and Preventable WWW
链接: https://arxiv.org/abs/2606.27326
作者: Nicklas Hansen,Xiaolong Wang
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Interactive paper, live demo, code, dataset, and models: this https URL
Abstract:Modern generative world models render increasingly realistic action-controllable futures, yet they frequently hallucinate: rollouts remain visually fluent while drifting from the ground-truth dynamics. We hypothesize that hallucination concentrates in low-coverage regions of the state-action space, where lightweight data-centric signals can both detect it and guide mitigation. To test this, we introduce MMBench2, a 427-hour, 210-task dataset for visual world modeling with ground-truth actions, rewards, and live simulators, and train a 350M-parameter world model on it. We identify three distinct hallucination modes: perceptual, action-marginalized, and scene-diverging – each anchored to a different stage of the pipeline, and develop three signals that accurately predict where the model will fail. To close coverage gaps at training time, we develop a coverage-aware sampling technique; to close them online, our hallucination predictors serve as curiosity rewards for targeted data collection, yielding a data-efficient finetuning recipe that adapts the pretrained world model to entirely unseen environments with as few as 50 real environment trajectories. Overall, our findings reveal that hallucination in world models is inherently a data coverage issue, and that the same signals used to detect it can also be used for mitigation. An interactive web version of our paper is available at this https URL Comments: Interactive paper, live demo, code, dataset, and models: this https URL Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2606.27326 [cs.LG] (or arXiv:2606.27326v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.27326 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-11] Not All Actions Are Equal: Rethinking Conditioning for Dexterous World Model
链接: https://arxiv.org/abs/2606.27325
作者: Zizhao Yuan,Zhengtu Liang,Taowen Wang,Qiwei Liang,Yichi Wang,Yunheng Wang,Yuetong Fang,Lusong Li,Zecui Zeng,Renjing Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in action-conditioned world models show promising progress in modeling complex interactions and forecasting future states under diverse action sequences. While these models are often driven by stronger visual representations and model capacity, action conditioning itself remains underexplored. Most existing approaches compress the entire action sequence into a single representation, which works well for low-DoF control but becomes less reliable in high-DoF scenarios. We observe that high-DoF dexterous actions are inherently heterogeneous, spanning multiple orders of magnitude, where large-scale motions coexist with subtle but important signals. When uniformly aggregated, optimization exhibits an imbalance across action components, which hinders the modeling of fine-grained effects and affects action fidelity. We therefore propose DexAC-WM, which treats action conditioning as a structured process rather than global compression. DexAC preserves dimension-level semantics via action tokenization and aligns action signals with visual dynamics through local refinement and global modulation. To address the limited high-level semantic grounding in existing world models, we further introduce a semantic branch that provides rich object-scene priors, which enables world model to capture dynamic visual details while supporting high-DoF action-conditioned video prediction. Experiments on EgoDex and EgoVerse show that combining the semantic branch with DexAC significantly improves FID, FVD, and PCK, demonstrating gains in visual-temporal realism and action-following consistency. We further verify that DexAC extends to other backbones, showing the scalability of our structured action-conditioning design. These results suggest that scaling world models to high-DoF control requires both structured action modeling and semantic grounding.
[CV-12] OctoSense: Self-Supervised Learning for Multimodal Robot Perception
链接: https://arxiv.org/abs/2606.27317
作者: Anthony Bisulco,Jeremy Wang,Kostas Daniilidis,Randall Balestriero,Pratik Chaudhari
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:We present OctoSense, an open-source sensor platform with stereo RGB and event cameras, LiDAR, a thermal camera, an inertial measurement unit, RTK-corrected global positioning system, and proprioception (CAN bus data from a car, and joint angles for a quadruped robot). The eponymous OctoSense dataset contains 59 hours of time-synchronized driving data across different types of environments at different times of the day, including situations with highly degraded sensors. We demonstrate multi-modal self-supervised learning using such real-world robotics data, where sensors have different representations, frequencies, latencies and noise. Our approach, a “late-fusion” masked autoencoder, (i) uses modality-specific tokenizers to account for different spatiotemporal characteristics of these sensors, and (ii) caches modality-specific tokens at inference time to process new measurements as they come. This architecture (i) is fast (6.68 ms and 112 ms on NVIDIA 5090 and Orin NX respectively, to compute the representation), (ii) performs better than existing image-only foundation models on tasks such as estimation of optical flow, depth, semantic segmentation, and ego-motion (translation, rotation, and steering angle), and (iii) predicts robustly at nighttime or in situations where sensory data is degraded. See our project page for links to the dataset, code, and supplementary videos: this https URL.
[CV-13] ViQ: Text-Aligned Visual Quantized Representations at Any Resolution ECCV2026
链接: https://arxiv.org/abs/2606.27313
作者: Xumin Yu,Zuyan Liu,Zhenyu Yang,Yuhao Dong,Shengsheng Qian,Jiwen Lu,Han Hu,Yongming Rao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026
Abstract:A unified representation for text and vision is a natural pursuit, as it enables simpler multimodal modeling and more efficient training. However, representing images as discrete signals in the same way as text inevitably introduces severe information loss. Existing work struggles to balance low-level details and high-level semantics in discrete representations: reconstruction-oriented representations often lack semantic information, whereas semantically stronger features typically suffer from severe loss of detail. We present ViQ, a Visual Quantized Representations framework, which is designed to balance semantics and details in discrete representations while supporting inputs at native resolutions, thereby enabling it to serve as a unified and general discrete representation for arbitrary visual inputs. Our approach structures quantization learning into two stages: text-aligned pre-training and feature discretization. With text-aligned pre-training, we enhance the visual encoder semantic-rich supervision from the pretrained language model and enable it to process native-resolution visual inputs. During discretization, we propose a proximal representation learning strategy to progressively compact the feature space, along with a position-aware head-wise quantization mechanism that enables flexible processing of arbitrary resolutions. Extensive experiments on multimodal tasks demonstrate that ViQ achieves competitive performance compared to state-of-the-art multimodal vision encoders with continuous and high-dimensional visual features, while maintaining high precision in low-level reconstruction. We also show that multimodal training with visual quantized representations largely improves efficiency, yielding up to 20%-70% acceleration with different base LLMs and training recipes.
[CV-14] See Sniff: Learning Visuo-Olfactory Representations KR DSN ECCV2026
链接: https://arxiv.org/abs/2606.27307
作者: Seongyu Kim,Seungwoo Lee,Hyeonggon Ryu,Joon Son Chung,Arda Senocak
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026. Project Page: this https URL
Abstract:While modern multimodal models integrate vision with language, audio, or touch, olfaction remains largely unexplored due to the lack of paired visuo-olfactory data. We introduce SmellNet-V, a scalable visuo-olfactory dataset built on the insight that odor identity is largely invariant to visual transformations within a semantic category. This allows us to synthetically pair smell-only samples with semantically aligned in-the-wild web images, converting a unimodal olfactory dataset into a cross-modal benchmark without costly co-collection. Building on this dataset, we propose See Sniff, a self-supervised framework that learns joint visuo-olfactory representations via dense local alignment and naturally produces smell saliency maps for spatial grounding of odor sources. We further introduce pixel-level smell localization task and a benchmark for evaluation. Our method surpasses smell-only baselines by 7% in smell classification from smell alone and generalizes to cross-modal retrieval and smell localization, establishing visuo-olfactory learning as a new direction in multimodal perception.
[CV-15] Sculpting NeRF Geometry: Human-Preference Fine-Tuning of a 3D-Aware Face GAN
链接: https://arxiv.org/abs/2606.27305
作者: Archer Moore,Mingming Gong,Liam Hodgkinson
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reinforcement learning from human feedback (RLHF) for 3D generation is now established across a number of works, but most existing pipelines optimise explicit surface representations, often by converting radiance fields into meshes and training heavily on surface-supervised data. We instead fine-tune a pretrained 3D-aware generative model directly from a learned reward over radiance-field density ( \sigma ) values, with no externally supplied mesh or shape prior. The reward model requires no pretraining, trains easily on a small set of preference samples, and yields robust improvement in 3D geometry. Working on an unconditional 3D-aware face GAN (EG3D), our reward reads the continuous 3D density field of the neural radiance field (NeRF) directly and supplies a geometry-only learning signal, requiring neither text conditioning, mesh extraction, nor multi-view rendering. A density-consistency constraint keeps the 2D appearance qualitatively similar while the geometry is reshaped, at a measurable but bounded distributional cost (FID-50k rises from 4.09 to 6.66): the fine-tuned generator, trained from the preferences of a single annotator as a proof of concept, produces face geometries preferred by users in 74.4% of pairwise comparisons.
[CV-16] Exact and Deterministic Patch Descriptor Retrieval via Hierarchical Normalization
链接: https://arxiv.org/abs/2606.27280
作者: Koichi Sato
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures
Abstract:We present a patch descriptor retrieval method that returns the exact nearest neighbour – provably identical to exhaustive full-vector search – while evaluating only a small fraction of the database, and does so deterministically: the same (database, query) pair always produces the same result, independent of run order, thread count, or hardware. This contrasts with approximate nearest-neighbour (ANN) approaches such as HNSW and IVF-PQ, which trade exactness for speed and may return different results across runs. The enabling mechanism is Hierarchical Normalization (HN): a normalisation scheme that splits the pre-normalisation feature vector into a K-dim major component (norm sqrt(1-alpha)) and a (128-K)-dim minor component (norm sqrt(alpha)). Since the minor inner product is bounded by alpha (Cauchy-Schwarz on the prescribed norms), the major similarity plus alpha is an admissible upper bound on the full similarity: the search scans the K-dim major component for all entries, then applies full 128-dim evaluation only to entries that cannot be pruned – a provably exact branch-and-bound scan. We train HN-modified HardNet on the notredame split of the UBC patch dataset and evaluate on trevi and halfdome. With a cache-optimised Structure-of-Arrays layout and K=8, alpha=1/32, the search achieves 13.7x (trevi) / 12.7x (halfdome) speed-up over brute-force 128-dim search, with only 0.4% of entries requiring full evaluation. At K=16, alpha=1/8, FPR@95 rises from 0.0062 to 0.0064 on trevi at 7.2x speed-up, with 98.8% of entries bypassing full evaluation.
[CV-17] EO-WM: A Physically Informed World Model for Probabilistic Earth Observation Forecasting
链接: https://arxiv.org/abs/2606.27277
作者: Junwei Luo,Shuai Yuan,Zhenya Yang,Yansheng Li,Zhe Liu,Hengshuang Zhao
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 5 figures, 11 tables
Abstract:Earth Observation (EO) forecasting aims to predict future Earth surface dynamics from satellite observations under changing meteorological conditions. In this paper, we view this task as a partially observed, weather-driven world modeling problem, in which weather acts as a conditioning signal, while forecasting remains uncertain due to sparse observations and unobserved land-surface states. However, existing methods do not fully capture this setting: deterministic models collapse uncertainty into a single future prediction, while diffusion-based methods typically treat weather variables as undifferentiated conditioning signals, and existing benchmarks focus mainly on reconstruction accuracy rather than whether forecasts respond correctly to changed weather this http URL introduce EO-WM, a video diffusion transformer for multispectral EO forecasting. EO-WM incorporates a physically informed conditioning framework that represents meteorological forcing through a climatological baseline, weather anomalies, and cumulative physical stress signals. Specifically, it separates baseline and anomaly through distinct conditioning pathways, and accumulates anomalous forcing over time to capture sustained heat and drought stress. To evaluate weather-response behavior beyond standard metrics, we introduce two diagnostic benchmarks: an Extreme Summer Benchmark for severity-aware prediction of vegetation degradation under extreme weather, and a Seasonal Matched-Pair Benchmark for testing response fidelity under changed weather forcing. Experiments show that EO-WM reduces the error in predicted Normalized Difference Vegetation Index (NDVI) decline amplitude by a relative 5.63% and improves directional hit rate by a relative 7.80%, while remaining competitive on standard pixel-level metrics. The benchmarks and model will be made open-source at this https URL.
[CV-18] CORTEX: A Structured Reasoning Benchmark for Trustworthy 3D Chest CT MLLM s
链接: https://arxiv.org/abs/2606.27264
作者: Hashmat Shadab Malik,Anees Ur Rehman Hashmi,Numan Saeed,Muzammal Naseer,Salman Khan,Christoph Lippert
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reasoning in multimodal large language models (MLLMs) has shown strong promise in medical imaging. However, this reasoning is usually free-form text judged only by its final answer, making it hard to interpret and verify, especially in 3D radiology, where a diagnosis should be traceable to evidence in the scan. Existing chest CT question-answering datasets compound this by reducing expert radiology reports to answer-only pairs, dropping the reasoning that links findings to conclusions and omitting the patient history clinicians rely on. As a result, reasoning-capable 3D chest CT MLLMs remain out of reach, as neither the structured supervision needed to train them nor the protocol needed to verify their reasoning yet exists. We introduce CORTEX (Clinically Organized Reasoning and sTructured EXplanation), a structured reasoning benchmark for 3D chest CT. For each question, CORTEX restores the missing reasoning as a four-stage diagnostic trace mirroring a radiologist’s workflow: task understanding, visual observation, diagnostic reasoning, and answer synthesis. We generate these traces using frontier large language models with broad medical and general-domain knowledge, then filter and verify them with a stage-level evaluation protocol combining automated rubric scoring with expert radiologist review. Crucially, both the reasoning structure and evaluation rubrics are designed in close collaboration with clinicians. Built on CT-RATE, a large, publicly available chest CT dataset without reasoning annotations, CORTEX comprises 76,177 validated reasoning traces across open-ended VQA, closed-ended VQA, and report generation, providing both the structured supervision and the stage-level evaluation protocol needed to build and evaluate trustworthy reasoning models for 3D chest CT. Our dataset and evaluation code will be made publicly available upon acceptance.
[CV-19] SatSplatDiff: Geometry-preserving generative refinement for high-fidelity satellite Gaussian Splatting
链接: https://arxiv.org/abs/2606.27223
作者: Jiyong Kim,Shuang Song,Ronjgun Qin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 15 figures
Abstract:Gaussian Splatting has been recently explored for satellite 3D reconstruction, demonstrating flexibility and efficiency in representing radiometrically diverse satellite scenes. However, the limited top viewpoint of satellite imagery results in insufficient supervision on building facades, leaving surface holes and degraded visual fidelity. Generative refinement, which leverages pretrained generative priors to iteratively refine and update the rendered images used as supervision targets, has recently been investigated to improve the visual fidelity of Gaussian-rendered images. However, since these models refine each view independently, the resulting images can generate hallucinations and break photo-consistency, leading to geometric degradation. To address these limitations, we propose SatSplatDiff, which aims to minimize geometric degradation prevalent in generative refinement. Building on photogrammetric DSM initialization and 2DGS-based shadow casting established in our prior work SatSplat, we first introduce monocular depth supervision and multi-scale geometric refinement to establish a geometrically accurate and well-regularized surface representation. We then apply shadow-guided generative refinement, where geometrically calculated shadow maps guide the Gaussians to maintain consistency with the underlying geometry, improving visual fidelity while reducing geometric degradation. Extensive evaluations on the IARPA2016 and DFC2019 datasets demonstrate state-of-the-art performance, reducing geometric MAE by up to 18% and improving visual fidelity (FID-CLIP) by 28-45% over existing baselines. Our method delivers up to 5x resolution enhancement with minimal hallucination and sensor-consistent appearance, demonstrating seamless cross-tile consistency and strong scalability for large-scale reconstruction. Source code is available at this https URL
[CV-20] LISA: Likelihood Score Alignment for Visual-condition Controllable Generation
链接: https://arxiv.org/abs/2606.27192
作者: Yanghao Wang,Hongxu Chen,Jiazhen Liu,Zhenqi He,Rui Liu,Zhen Wang,Long Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The prevalent dual-branch paradigm, i.e., training a side network to encode visual conditions and fusing its intermediate-layer features to a frozen pretrained main network, has shown remarkable success in visual-condition controllable generation. Despite its widespread adoption, the role of the side branch and its training efficiency remain underexplored. In this paper, we first revisit this mainstream paradigm through the lens of score-based generative modeling: 1) The main network preserves visual perceptual quality by providing a prior unconditional score. 2) The side network steers conditional control by implicitly contributing a likelihood score. Guided by this perspective, we propose LIkelihood Score Alignment (LISA), an effective regularization method that explicitly aligns the intermediate feature of the side network with an approximated likelihood score. Specifically, we first hook features from a designated layer of the side network and project them into the score latent space by a lightweight decoder. Then, we construct an approximated likelihood score target and calculate the distance between the decoder’s output and this target as an additional regularization loss. Finally, we jointly optimize the side network and decoder with both standard diffusion loss and our regularization loss. Experiments across various image/video tasks, architectures, and diffusion/flow models demonstrated that LISA can not only consistently accelerate the training convergence and improve final synthetic results, but also encourage the side network’s features to be more disentangled for conditional modeling with negligible additional training cost and zero extra inference cost.
[CV-21] Safe Autoregressive Image Generation with Iterative Self-Improving Codebooks ICML2026
链接: https://arxiv.org/abs/2606.27147
作者: Yunqi Xue,Zhijiang Li,Philip Torr,Jindong Gu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages including references, 8 figures, accepted for publication at the 43rd International Conference on Machine Learning (ICML 2026)
Abstract:Unlike diffusion-based models that operate in continuous latent spaces, autoregressive unified multimodal models produce images by sequentially predicting discretized visual tokens. These tokens are derived from a codebook that maps embeddings to quantized visual patterns. The language-like architecture enables unified multimodal models to effectively capture text conditional information for generation, making them promising for text-to-image tasks. This also raises an interesting question: how safe are the images generated in such an autoregressive way? In this work, we propose iterative self-improving codebooks for safe autoregressive generation. We leverage the understanding and judgment capabilities of the unified multimodal model itself to identify unsafe generated images without human annotation. Subsequently, the inherent representations in the codebook are fixed to eliminate harmful mappings. Our method comprises two steps: first, we use the unified model to identify unsafe generations and construct corresponding harmful and safe image-text pairs. These pairs are used to construct the Harmful Space and guide updates to the codebook, thereby eliminating harmful outputs. Second, we perform adaptive fine-tuning on the codebook within the harmless space using safe image-text pairs to ensure the quality of generated images. These two steps are repeated until no further improvement is observed, producing a safety-enhanced model codebook. Without additional external feedback, the safety of models is improved iteratively.
[CV-22] FlameVQA: A Physically-Grounded UAV Wildfire VQA Benchmark with Radiometric Thermal Supervision
链接: https://arxiv.org/abs/2606.27128
作者: Mobin Habibpour,John Spodnik,Niloufar Alipour Talemi,Fatemeh Afghah
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Wildfire monitoring from UAVs requires reliable reasoning over complex aerial scenes, where smoke, scale variation, and occlusions often limit RGB-only interpretation. We introduce FlameVQA, a multiple-choice visual question answering benchmark for UAV-based wildfire intelligence built on FLAME 3, leveraging paired RGB imagery and radiometric thermal TIFFs for temperature-grounded, safety-critical reasoning. FlameVQA includes 34 multiple-choice questions per image spanning six operational capability groups, covering tasks such as detection, localization, distribution/coverage estimation, cross-modal reasoning, and flight planning. To ensure label reliability, we combine MLLM-assisted annotation with deterministic thermal rules and cross-question consistency checks, followed by human auditing. We also evaluate representative MLLMs on FlameVQA to provide baselines for future work. Results show strong performance when explicit cross-modal cues are available, but notable failures on presence detection under heavy smoke and on coverage estimation. These findings suggest that current MLLMs require domain-specific adaptation to better support disaster and wildfire monitoring. The dataset and benchmark code are open-source at this http URL
[CV-23] Proposal-Conditioned Latent Diffusion for Closed-Loop Traffic Scenario Generation ITSC
链接: https://arxiv.org/abs/2606.27123
作者: Shubham Vaijanath Phoolari,Aleyna Kara,Christoph Lauer,Steven Peters
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the IEEE International Conference on Intelligent Transportation Systems (ITSC), 2026
Abstract:Closed-loop traffic simulation remains challenging because it must generate interactive multi-agent behaviors that are scene-consistent and controllable throughout rollout. Prior diffusion-based approaches achieve strong realism, but their computational cost can hinder deployment in time-constrained replanning loops for autonomous vehicle planning and simulation. We present a diffusion-based scenario generation framework conditioned on instance-centric scene context and multimodal proposal priors, with optional test-time guidance for shaping safety-critical behaviors. A compact action-latent representation and proposal-based initialization improve sampling efficiency and reduce per-step runtime without retraining. Experiments on the Waymo Open Motion Dataset demonstrate a favorable balance among realism, safety, and controllability across diverse interactive scenarios, while showing that test-time guidance enables systematic trade-offs among competing objectives.
[CV-24] MP: Tree-structured Mixed-policy Pruning for Large-scale Image Generation and Editing
链接: https://arxiv.org/abs/2606.27089
作者: Peizhen Zhang,Yang Li,Xunsong Li,Songtao Liu,Zewen Liu,Qiangqiang Hu,Guotong Guo,Jupeng Ding,Yifu Sun,coopersli,Jian Zhang,Zhao Zhong,Liefeng Bo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures, 3 tables, tech report
Abstract:Modern image generation model rapidly grows their sizes to meet high-fidelity image synthesis. However, they gradually become unaffordable for their enormous parameter consumption and computation budget that lead to massive resources requirement and gpu memory footprint. In this paper, we propose TMP, the first Tree-structured Mixed-policy Pruning framework that generalizes prevalent image tasks (T2I and TI2I) and architectures (Mixture-of-Experts (MoE) and Diffusion transformer (DiT)). It could be applied to the step-distilled models and contribute as the last stage. We perform experiments upon current open-sourced SOTA HunyuanImage-3.0 instruct and a popular efficient model Z-Image turbo. The proposed pruning framework manages to compress HunyuanImage 3.0 from 80B to 20B parameters at 75% reduction ratio, sacrificing limited generation quality. We also optimize to enable the inference of the pruned 20B version of HunyuanImage 3.0 on a single 24GB 4090 GPU by engineering skills. The inference script and model weight have been integrated into the existing HunyuanImage3.0 open-source github and huggingface repository. Besides, we prove the efficacy of TMP by compressing Z-Image turbo from 6B to 4B (33% reduction) with negligible degradation.
[CV-25] SubdivAR: Autoregressive Next-Scale Prediction for Neural Mesh Subdivision
链接: https://arxiv.org/abs/2606.27088
作者: Huipeng Guo,Zikai Song,Hang Long,Jielei Zhang,Wenbing Li,Junkai Lin,Tianhao Zhao,Jinshen Zhang,Tianle Guo,Wei Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Mesh subdivision is a fundamental operation for converting coarse, editable meshes into high-resolution surfaces, with broad applications in digital asset creation. Classical rule-based schemes rely on fixed local refinement rules and often produce over-smoothed surfaces. Recent neural subdivision methods improve detail synthesis, but remain constrained by local modeling and exhibit limited generalizability. We present SubdivAR, a neural mesh subdivision framework based on our proposed Mesh Autoregressive Representation (MAR). MAR arranges meshes at different subdivision levels into an ordered scale sequence, reformulating subdivision as autoregressive next-scale prediction. To support this formulation, we introduce a Hybrid Topology-Aware Transformer that combines global semantic attention with topology-constrained local feature aggregation. SubdivAR adopts a next-scale coordinate prediction paradigm, regressing vertex offsets at each refinement stage to preserve subdivision topology while recovering fine-grained geometric details. To enable reliable learning, we construct FII-40K, a curated dataset of nearly 40,000 high-quality meshes with multi-level subdivision supervision. Experiments show that SubdivAR outperforms state-of-the-art baselines, reducing Hausdorff Distance and Chamfer Distance by 18.8% and 14.2%, respectively, and demonstrates strong robustness on complex open-surface geometries.
[CV-26] Pseudo-Text-Conditioned 3D Grounding DINO for Organ Localization in Abdominal CT
链接: https://arxiv.org/abs/2606.27084
作者: Siqi Chen,Han Gong,Keyi Hou,Jingxuan Yang,Sheethal Bhat,Andreas Maier
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 24 pages, 17 figures
Abstract:Reliable organ localization in abdominal CT can provide spatial priors for downstream trauma analysis. We propose CT-3GDINO, a lightweight 3D detector that adapts a Grounding-DINO-style query-based architecture to fixed organ localization using frozen pseudo-text class tokens instead of a real text encoder. The model combines a Swin3D visual backbone, bidirectional feature enhancement, pseudo-text-guided query selection, and a cross-modality decoder to predict normalized 3D boxes for liver, spleen, left kidney, right kidney, and bowel. We train and evaluate on 193 matched RSNA/RATIC CT volumes with segmentation-derived boxes. The best multi-scale model, trained from scratch, achieves 0.5830 overall top-1 class-wise mAP over 3D IoU thresholds from 0.1 to 0.7, outperforming fixed- and trainable-backbone classification-pretrained variants with 0.5570 and 0.4657 mAP. Performance is strong for coarse localization, with 0.9649 AP at IoU 0.1, but remains limited for strict box alignment, with 0.1552 AP at IoU 0.7. These results establish CT-3GDINO as an open-source baseline for pseudo-text-conditioned 3D organ localization and motivate future work on localization-aware pretraining, richer multimodal conditioning, and injury-focused detection.
[CV-27] PanoImager: Geometry-Guided Novel View Synthesis and Reconstruction from Sparse Panoramic Views IROS2026
链接: https://arxiv.org/abs/2606.27071
作者: Zhisong Xu,Takeshi Oishi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IROS 2026
Abstract:Panoramic sensing offers wide field-of-view coverage, yet 3D reconstruction from sparse panoramas remains challenging under rotation-dominant, weak-parallax motion. In such regimes, SfM/SLAM initialization is often ill-conditioned and unreliable. We present PanoImager, an SfM-free framework that combines feed-forward pose/depth priors, geometry-conditioned diffusion view completion, and depth-guided 3DGS optimization. Given only a few panoramic images, PanoImager decomposes them into local perspective views, synthesizes auxiliary observations to enrich sparse evidence, and stabilizes Gaussian optimization for improved cross-view consistency. Experiments on multiple benchmarks show improved stability under extreme sparsity, suggesting PanoImager as an offline/background component for map refinement when SfM/SLAM fails to initialize.
[CV-28] On-board Remote-Sensing Foundation Models for Unsupervised Change Detection of Disaster Events
链接: https://arxiv.org/abs/2606.27018
作者: S. Ramírez-Gallego
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Remote Sensing Foundation Models (RSFMs) have emerged as a powerful alternative to supervised models for Earth Observation, allowing satellites to autonomously trigger high-resolution captures or adjust tasking parameters upon detecting an anomaly, thereby maximizing the utility of the mission’s limited power and computational resources. RSFMs are versatile, unified encoders that optimize onboard storage for multiple orbital applications while ensuring high-fidelity feature extraction. In particular, unsupervised change detection with RSFMs offers a well-informed and transformative path for disaster monitoring without expensive labels. In this paper, we present a novel unsupervised detection method based on ResNet (RSFM) + FPN which identifies a wide spectrum of anomalies by detecting subtle semantic shifts in the latent space between successive orbital passes. By relying on an untrained FPN architecture and its intrinsic priors, the system achieves efficient image-level generation and higher resolution mapping with minimal effort (training-free) compared to previous proposals (patch-based, trained). And by replacing tailored models with RSFMs, we can achieve comparable results through an approach that eliminates the need for bespoke training and extensive development effort and adds customization, while ensuring high-performance generalization across diverse terrains and sensors.
[CV-29] Event-Aware Instructed Assistant for Referring Video Segmentation
链接: https://arxiv.org/abs/2606.26994
作者: Jinyu Liu,Henghui Ding,Shuting He,Yu-Gang Jiang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: IEEE Transactions on Image Processing
Abstract:Existing referring video segmentation methods often treat a video as a single event consisting of multiple images, overlooking the fact that a video typically contains multiple distinct events. Under such a mechanism, the model needs to directly understand all the complex content in the video and text, which can easily lead to confusion and hallucinations. To address this issue, we propose to decompose a video to a set of simple events by learnable Event Query, and understand complex video content in an event-by-event, easy-to-understand manner. This is based on the observation that natural language expressions often divide a video into distinct, text-related segments, each representing a separate event within a compound event. We introduce EVIS, an Event-Aware Video Instructed Segmentation Assistant, which utilizes text-guided Event Queries to partition a video into simple events, extracting event-aware visual-text features to achieve a hierarchical understanding of the video. Additionally, we propose Object-Pixel-Hybrid Learning, which enables the MLLMs to track targets in long-term videos by integrating fine-grained pixel features with prior object queries. Extensive experimental results on 5 public benchmarks demonstrate EVIS’s strong performance in addressing the referring video segmentation task.
[CV-30] Unison: Benchmarking Unified Multimodal Models via Synergistic Understanding and Generation ICML2026
链接: https://arxiv.org/abs/2606.26984
作者: Jinyu Liu,Xincheng Shuai,Henghui Ding,Yu-Gang Jiang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026
Abstract:Unified multimodal models capable of both understanding and generation have achieved remarkable strides. However, despite their unified designs, existing evaluations typically assess understanding and generation capabilities in isolation, overlooking the synergy between comprehension and generation. To bridge this gap, we introduce Unison, a comprehensive benchmark comprising 2,169 high-quality unified task samples, designed to evaluate joint understanding and generation in unified multimodal models. Unison offers three key strengths: 1) Comprehensive Dimensions: Unison encompasses internal consistency, understanding-guided generation, generation-guided understanding, and mutual enhancement to enable holistic evaluation. 2) Diagnostic Evaluation: it provides both unified and decoupled tracks for understanding and generation, allowing fine-grained attribution of failure modes and quantitative analysis of the gains from unified modeling. 3) Human Alignment: we also introduce Unison-Judge, an evaluation model well aligned with human judgments to ensure reliable assessment. Based on systematic evaluations of state-of-the-art models on Unison, we uncover critical limitations in current unified multimodal systems and highlight promising directions for future research. Codes, Unison and Unison-Judge are publicly available at this https URL.
[CV-31] Geometric Gradient Rectification for Safe Open-Set Semi-Supervised Learning ECCV2026
链接: https://arxiv.org/abs/2606.26973
作者: Jiahe Chen,Qian Shao,Qiyuan Chen,Jiaying He,Jintai Chen,Jian Wu,Hongxia Xu
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ECCV 2026
Abstract:Open-set semi-supervised learning aims to leverage unlabeled data that may contain out-of-distribution outliers while maintaining performance on in-distribution classes. Existing methods mainly follow two paradigms: filtering suspicious samples or incorporating unlabeled objectives with soft weighting. We argue that both face a common trade-off: aggressive filtering can discard informative but hard ID samples, whereas utilization can introduce auxiliary gradients that conflict with supervised learning when pseudo labels are wrong. We therefore shift the focus from sample selection to gradient-level control. We propose \textitGeometric Gradient Rectification (GGR), a plug-in framework that uses the supervised gradient as an anchor and projects conflicting auxiliary gradients onto an admissible region in gradient space. This makes the applied auxiliary update first-order non-opposing within the rectified coordinate block while preserving orthogonal components that may still carry useful representation signals. We further extend GGR with subspace-aware rectification to stabilize the anchor under noisy mini-batch gradients. Experiments on CIFAR and ImageNet benchmarks show that GGR improves representative OSSL baselines in most settings and yields gains in both closed-set generalization and open-set robustness. Code will be available at this https URL.
[CV-32] Computer Vision for MOBA Analytics: A Dataset and Baseline for Visibility Analysis in Dota 2
链接: https://arxiv.org/abs/2606.26970
作者: Ricardo da Rocha Carvalho,Eloísa Oliveira,Luiz Bernardo Martins Kummer,Emerson Cabrera Paraiso,Rayson Laroca
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at the 2026 Simpósio Brasileiro de Jogos e Entretenimento Digital (SBGames)
Abstract:Introduction: Most Multiplayer Online Battle Arena (MOBA) analytics studies rely on structured data, which does not directly capture what each team could actually see during a match. Objective: This work introduces Dota2-Vis, a video-based dataset, and a baseline pipeline for visibility analysis in professional Dota 2 matches. Methodology: The dataset comprises all 144 matches from The International 2025, recorded from both team perspectives, totaling 288 Full HD videos, together with 2,477 manually annotated minimap images. We evaluate multiple variants of a modern object detector for player-icon detection and use the best-performing model to estimate opponent-visible player presence over time. Results: YOLO11l (large) achieved the best overall performance, reliably identifying player icons even in dense and visually cluttered minimap scenes. The resulting visibility curves reveal player, hero, role, and team-level patterns that complement conventional MOBA analytics, highlighting behavioral differences that are difficult to obtain from structured data alone. The dataset and code are publicly available at this https URL.
[CV-33] Look-Before-Move: Narrative-Grounded World Visual Attention in Dynamic 3D Story Worlds
链接: https://arxiv.org/abs/2606.26964
作者: Jiaming Bian,Bingliang Li,Yuehao Wu,Pichao Wang,Zhi Wang,Hailan Ma,Huadong Mo,Zhenhong Sun
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 17 figures
Abstract:As embodied AI and world models increasingly operate in dynamic 3D environments, visual perception must move beyond passively interpreting given observations toward actively deciding what to observe. We study this problem through camera planning in dynamic 3D story worlds, where the camera must not only generate smooth motion, but also decide what visual evidence should be acquired before it moves. We formulate this capability as Narrative-Grounded World Visual Attention, where the camera acts as an embodied observer that determines what to observe, how to compose the observation, and how to shift attention over time under narrative intent and physical 3D constraints. To realize this capability, we propose Look-Before-Move, a camera planning framework that separates observation specification from motion execution. It first builds a Semantic Observation Contract to convert directorial intent into executable visual constraints, then performs Monte Carlo Viewpoint Search to find narrative-compliant and geometrically feasible viewpoints, and finally applies Semantic Trajectory Grounding to connect selected viewpoints into continuous, collision-aware, and temporally coherent camera motion. We further construct a dynamic 3D Story World Benchmark based on StoryBlender, covering 50 stories, 457 scenes, and 1585 shots with animated characters, semantic scene configurations, and executable 3D environments. Experiments show that our framework improves subject perception, intent consistency, and trajectory quality over representative baselines, demonstrating the importance of organizing visual attention before generating camera motion.
[CV-34] Scaling Multi-Reference Image Generation with Dynamic Reward Optimization ECCV2026
链接: https://arxiv.org/abs/2606.26947
作者: Wenwang Huang,Yusen Fu,Junjie Wang,Mengfei Huang,Yulin Li,Gan Liu,Jing Cai,Yancheng He,Zhuotao Tian
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ECCV2026
Abstract:While personalized image generation has achieved remarkable progress, multi-reference image generation (MRIG) remains a challenging task. Most existing benchmarks fail to adequately evaluate complex MRIG scenarios, hindering further progress in this area. To better assess model performance on complex MRIG tasks, we introduce OmniRef-Bench, a benchmark that covers complex combinations of reference image types and a large number of reference images. Evaluations on OmniRef-Bench show that mainstream open-source models struggle in complex MRIG scenarios, and their performance deteriorates significantly as the number of mixed-type reference images increases. To address this issue, we propose DyRef, a two-stage training framework. In the first stage, supervised fine-tuning equips the model with the basic capability to handle complex MRIG tasks. In the second stage, we introduce Difficulty-aware Advantage Reweighting (DAR) and Discriminative Reward Scaling (DRS). DAR dynamically adjusts the optimization objective to improve performance when handling a large number of mixed-type reference images. DRS enlarges intra-group reward differences for more effective policy optimization. Experiments demonstrate that DyRef significantly improves the performance of open-source models on OmniRef-Bench and single-image editing benchmarks, demonstrating the effectiveness and generalization capability of our approach.
[CV-35] raMP-LLaMA: Generative Interpretability with Decoupled Instruction Tuning for Facial Expression Quality Assessment
链接: https://arxiv.org/abs/2606.26942
作者: Shuchao Duan,Alan Whone,Hossein Rahmani,Jun Liu,Majid Mirmehdi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing facial expression quality assessment (FEQA) methods typically produce only a severity score, without explicitly communicating the observable facial motion evidence that supports the prediction. This limits interpretability and makes it difficult to inspect the basis of model outputs in Parkinson’s disease assessment. To address this gap, we propose TraMP-LLaMA, a unified multimodal framework that jointly predicts severity scores and generates structured textual reports from facial motion cues. The framework integrates RGB appearance and landmark trajectory cues, and adopts a decoupled instruction-tuning strategy to reduce task interference between severity prediction and language generation. To support this task, we further extend the PFED5 dataset with expert-guided textual motion descriptions and construct PFED5-plus. Experiments on PFED5-plus show that TraMP-LLaMA outperforms competitive video-language baselines in report generation and achieves the best severity prediction performance among the compared methods under joint multi-expression training, improving Spearman’s rank correlation by at least 4.39 percent over all competing methods. The text annotations and code are available at this https URL.
[CV-36] Focusing on What Matters: Saliency-Harnessing Accurate Routing for Diffusion MoE ECCV2026
链接: https://arxiv.org/abs/2606.26938
作者: Haoyou Deng,Keyu Yan,Chaojie Mao,Xiang Wang,Yu Liu,Changxin Gao,Nong Sang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026
Abstract:Mixture-of-Experts (MoE) architectures have emerged as a powerful paradigm for scaling diffusion models in visual generation. Recent advancements have focused on adaptively allocating computational resources across diverse tokens to improve efficiency and performance. However, we identify a routing assignment problem in existing diffusion MoE frameworks: the router fails to accurately allocate more computational resources to salient tokens. Our analysis attributes this failure to the router’s reliance on noise-corrupted latent features throughout the denoising process. Such stochastic noise obscures the critical structural and textural information, thereby preventing the router from effectively distinguishing salient tokens. To address this, we propose SharpMoE, a post-training framework with a saliency-harnessing accurate routing mechanism, which utilizes clean latent features as a noise-free guidance signal for routing. By bypassing the noise-distorted inputs, SharpMoE provides the router with clear saliency guidance, enabling the identification of salient tokens even in high-noise stages. Furthermore, we introduce a trajectory routing loss to constrain the compute allocation throughout the multi-step denoising trajectory, ensuring precise resource allocation along the generation rollout. Extensive experiments demonstrate that SharpMoE serves as a versatile, plug-and-play solution that further enhances the pretrained, converged MoE models, achieving state-of-the-art performance in visual generation.
[CV-37] PortraitGen: Exemplar-Driven GRPO with Dual-Reward Guidance for Photorealistic Portrait Generation
链接: https://arxiv.org/abs/2606.26930
作者: Xiaomin Li,Qian Liang,Yinan Li,Ying Zhang,Chen Li,Jing Lyu,Huchuan Lu,Xu Jia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reinforcement Learning like Group Relative Policy Optimization (GRPO) has significantly advanced text-to-image post-training. However, current methods often favor superficial aesthetics, such as over-saturated colors, leaving critical flaws like AI artifacts and biological implausibilities unresolved. We attribute these limitations to two primary factors: (1) The absence of real images during post-training confines GRPO sampling to the original distribution, failing to break inherent generative boundaries; (2) the optimization process lacks specific rewards targeting fine-grained artifacts like overly oily skin and other AI artifacts. To address this, we propose PortraitGen, a novel framework tailored for photorealistic portrait generation. First, we break inherent generative boundaries by directly introducing real images into the GRPO sampling groups, where image inversion is employed to obtain their transition probabilities and latents. Second, to explicitly steer the model toward photorealism, we introduce a complementary dual-reward mechanism: OmniReward for general quality and AI-Portrait for human-centric fidelity. Furthermore, we curate PortraitBench, a comprehensive portrait-centric benchmark. Extensive experiments demonstrate that PortraitGen significantly outperforms existing baselines, effectively suppressing AI artifacts and achieving unprecedented photorealism.
[CV-38] PhysRAG : Enhancing Physics-Awareness in Video Generation via Retrieval-Augmented Generation ECCV2026
链接: https://arxiv.org/abs/2606.26916
作者: Kexu Cheng,Zicheng Liu,Mingju Gao,Chunhe Song,Hao Tang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026
Abstract:Developing physically aware video generation models remains a significant challenge due to the difficulty in capturing diverse physical phenomena, such as thermal dynamics, mechanics, and optics. In this work, we introduce PhysRAG, a novel pipeline that enhances physical awareness in video generation through Retrieval-Augmented Generation (RAG). To address the issue of limited high-quality data, we design a two-stage data filtering pipeline based on the WISA-80K dataset, resulting in a curated set of 7K high-quality videos for training. Furthermore, we construct a physical video database and develop a mechanism to inject physical knowledge into a video diffusion model using learnable queries. Our method achieves state-of-the-art performance in both visual quality and physical rule compliance, surpassing existing models in benchmarks such as PhyGenBench and VBench. We conduct extensive ablation studies to validate the effectiveness of our key components, including the data filtering pipeline, RAG mechanism, and method for physical information extraction. To facilitate future research, our code, data, and models are prepared for release at this https URL.
[CV-39] Neural Texture Compression using Hypernetworks
链接: https://arxiv.org/abs/2606.26913
作者: Belcour Laurent
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 12 figures, conference
Abstract:Recent work on neural texture compression has demonstrated that it is possible to learn small, per-material texture representations (composed of latent textures and a small Multi-Layer Perceptron decoder) that can be decoded in real-time during shading to reproduce the input to a physically based shading model. However, existing methods require performing gradient-descent optimization per material for a given MLP and latent configuration. In this work, we train a single hypernetwork that outputs both the latent features and the MLP’s weights and biases. Though the solution space is high-dimensional, this approach produces results comparable in quality to the current reference neural texture compressors. We further extend this approach to infer multiple decoders at once or even produce decoders that learn super-resolution.
[CV-40] Qwen -Image-Agent : Bridging the Context Gap in Real-World Image Generation
链接: https://arxiv.org/abs/2606.26907
作者: Zekai Zhang,Jiahao Li,Jie Zhang,Kaiyuan Gao,Kun Yan,Lihan Jiang,Ningyuan Tang,Shengming Yin,Tianhe Wu,Xiaoyue Chen,Xiao Xu,Yan Shu,Yanran Zhang,Yixian Xu,Yuxiang Chen,Zhendong Wang,Zihao Liu,Zikai Zhou,Huishuai Zhang,Dongyan Zhao,Chenfei Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While text-to-image (T2I) models have achieved remarkable progress, they struggle with real-world requests that are often underspecified, implicit, or dependent on up-to-date knowledge. We identify this challenge as the Context Gap: the mismatch between the user context and the sufficient generation context for T2I models. To bridge this gap, we propose Qwen-Image-Agent, a unified agentic framework that integrates plan, reason, search, memory and feedback in a context-centric manner. Qwen-Image-Agent treats user input as partial context and progressively constructs the generation context through Context-Aware Planning and Context Grounding. Specifically, Context-Aware Planning identifies missing context and plans how it should be acquired and used, while Context Grounding gathers this context from reason, search, memory, and feedback. To evaluate agentic image generation, we further introduce Image Agent Bench (IA-Bench), a benchmark covering four core image agent capabilities: Plan, Reason, Search, and Memory. Experiments on IA-Bench, Mindbench and WISE-Verified show that Qwen-Image-Agent outperforms strong baselines and achieves state-of-the-art performance.
[CV-41] Confidence-Aware Tool Orchestration for Robust Video Understanding
链接: https://arxiv.org/abs/2606.26904
作者: Yangfan He,Yujin Choi,Jaehong Yoon
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:Video reasoning language models implicitly assume that every input frame is equally reliable. This leads to what we term the Blind Trust Problem: under realistic perturbations such as motion blur, glare, or occlusion, frontier video reasoning models can suffer 15-30%p accuracy drops on real-world embodied benchmarks, while remaining unaware that their visual evidence has been degraded. To address this challenge, we propose Robust-TO, an agentic video understanding framework that explicitly integrates per-frame trustworthiness into every stage of reasoning. Robust-TO organizes heterogeneous visual perception tools under a unified evidence interface. Each tool receives a sub-query derived from the original question and a set of trustworthy frames selected by the reliability-relevance score. It returns evidence in a shared format: a concrete prediction (e.g., a bounding box, motion trajectory, recognized text, or action label), temporal grounding, and a calibrated reliability score. During reasoning, these calibrated scores guide evidence weighting in a three-tier synthesis process (high/medium/low) and define a confidence-cost GRPO reward that jointly optimizes correctness, evidence reliability, and efficiency. On two video reasoning benchmarks spanning eight tasks, Robust-TO achieves 56.4% average accuracy on clean inputs, surpassing the strongest open-source baseline by 10.6%p and outperforming Gemini-2.5-Pro (46.2%). Under five realistic corruption types, Robust-TO maintains 54.3% average accuracy, 5.8%p above the strongest open-source baseline, while exhibiting the smallest clean-to-corrupted accuracy drop among all compared methods.
[CV-42] ractography-Driven Synthetic Data Generation for Fiber Bundle Segmentation in Tracer Histology MICCAI2026
链接: https://arxiv.org/abs/2606.26898
作者: Kyriaki-Margarita Bintsi,Sparsh Makharia,Yaël Balbastre,Joselyn Romero Avila,Julia F. Lehman,Suzanne N. Haber,Anastasia Yendiki
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: MICCAI 2026
Abstract:Diffusion MRI (dMRI) tractography enables non-invasive reconstruction of white-matter pathways, but its accuracy is fundamentally limited by indirect, low-resolution measurements of axonal organization. Tracer injection studies in non-human primates provide a gold standard for validating dMRI tractography. This, however, requires time-consuming manual annotation of fiber bundles in histology sections. We propose a synthetic-data augmented framework for automated fiber bundle segmentation in macaque tracer histology. Our approach uses ex vivo dMRI tractography as a generative prior to synthesize 2D image patches for training. This provides us with sufficiently realistic foreground texture, which we compose with backgrounds from blockface photos and diversify via domain randomization. A 2D U-Net is trained on mixed real and synthetic patches. Experiments on held-out brains demonstrate improved generalization across brains and fiber bundle densities compared to training with real data only. Training with synthetic data only leads to poor performance, underscoring the need for real supervision. Overall, our approach achieves performance comparable to the state-of-the-art while requiring 3x less manually annotated data.
[CV-43] Modeling Local Global and Cross-Modal Context in Multimodal 3D MRI
链接: https://arxiv.org/abs/2606.26894
作者: Minh Duc Do,Tillmann Rheude,Noel Kronenberg,Roland Eils,Benjamin Wild
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Brain MRI poses a fundamental challenge for machine learning: models must learn from high-dimensional 3D data spanning multiple co-registered modalities, despite the limited sample sizes typical of neuroimaging studies relative to the diversity in anatomy, pathology, and acquisition conditions. While multimodal imaging provides complementary information critical for clinical interpretation, effectively integrating these signals remains difficult. We propose Multimodal Intra- and Cross-Context Vision Transformer (MICViT), a 3D vision transformer that explicitly models both modality-specific representations and cross-modal interactions across local and global contexts. Concretely, MICViT combines four attention mechanisms: modality-specific local and global attention for intra-modal feature learning, and cross-modal local and global attention to capture interactions between modalities. We evaluate MICViT on brain age prediction across three heterogeneous datasets (UK Biobank, n=41,404; SOOP, n=1,062; Cam-CAN, n=613) using multiple MRI modalities (e.g. T1, FLAIR, DWI, SWI). MICViT consistently outperforms state-of-the-art CNN and transformer baselines in 3D settings. Notably, it benefits more strongly from multimodal inputs, yielding larger performance gains as additional modalities are incorporated. These results demonstrate that explicitly modeling intra- and cross-modal interactions is key to unlocking the full potential of multimodal brain MRI, highlighting a promising direction for representation learning in neuroimaging.
[CV-44] Bridging Vision and Language Concepts through Optimal Transport Semantic Flow
链接: https://arxiv.org/abs/2606.26891
作者: Chenyang Zhang,Anqi Dong,Guangming Zhu,Nuoye Xiong,Siyuan Wang,Lin Mei,Liang Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Concept Bottleneck Models (CBMs) promise transparent reasoning by predicting through human-interpretable concepts, yet their effectiveness fundamentally depends on how well visual and textual representations are aligned or matched. Existing vision-language CBMs often rely on pre-aligned encoders or global cosine similarity, which obscures fine-grained concept localization and fails to reflect true semantic geometry. In this work, we rethink concept alignment as a dynamic cross-modal transport process instead of static projection and propose the Optimal Transport Flow Concept Bottleneck Model (OTF-CBM). It first learns a data-driven semantic cost via Inverse Optimal Transport to measure cross-modal distances, and then performs unbalanced optimal-transport-based flow matching to model semantic transitions between visual patches and textual concepts. With velocity-based concept activation, OTF-CBM captures interpretable geometric relations without ODE integration. Experiments further show that OTF-CBM achieves superior classification accuracy and concept faithfulness, offering a new geometric and dynamical perspective for interpretable cross-modal reasoning.
[CV-45] RIS-Assisted Proactive Handover for Reliable mmWave Wireless Networks
链接: https://arxiv.org/abs/2606.26885
作者: Alaa Adnan,Mohammad Al-Quraan,Ahmed Zoha,M. Majid Butt,Sami Muhaidat,Muhammad Ali Imran,Marco Di Renzo,Lina Mohjazi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Millimeter-wave (mmWave) networks are highly susceptible to line-of-sight (LoS) blockages. Vision-aided wireless communications (VAWC) enable proactive handovers (PHO) to mitigate such blockages; however, PHO becomes challenging when no nearby base station (BS) is available. In such cases, reconfigurable intelligent surfaces (RIS) can be used to restore connectivity. To ensure timely PHO, the RIS configuration time must be taken into account, as the large number of RIS elements can limit responsiveness in time-sensitive scenarios. This work proposes a novel RIS-assisted PHO approach that optimizes the number of allocated RIS elements to balance signal processing complexity and link quality under handover timing constraints, making the RIS-assisted link more energy-efficient. An optimization problem based on particle swarm optimization (PSO) is formulated to determine the optimal end-to-end RIS link setup that runs offline to bypass latency constraints. Results show that reducing the number of RIS elements by 12% leads to a 10% decrease in dissipated energy without compromising the signal-to-noise ratio (SNR). Moreover, the RIS-assisted link achieves a 15–30 dB improvement in blocked regions while maintaining accurate PHO timing.
[CV-46] SpatialFlow-GRPO: Where Spatial Credit Drives Image Editing
链接: https://arxiv.org/abs/2606.26872
作者: Yankai Yang,Yancheng Long,Wei Chen,Xingyu Lu,Hongyang Wei,Bin Wen,Fan Yang,Tingting Gao,Han Li,Shuo Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent online reinforcement learning has substantially improved image editing quality. However, existing Flow-GRPO-style methods usually rely on a single whole-image reward, which makes fine-grained editing optimization difficult. We observe that a key obstacle in image editing is this spatial uniformity assumption: a whole-image reward cannot distinguish how different spatial regions contribute to image quality. To address this issue, we propose SpatialFlow-GRPO, a training framework that introduces spatially fine-grained reward feedback. The framework converts region-aware rewards into semantic-region-level optimization signals and aligns region advantages with the corresponding latent positions during policy updates. We also train a region-aware reward model, SFReward, construct SFReward-14K with region-annotated editing samples, and introduce MultiEditBench to evaluate multi-region editing ability. On OmniGen2 and FLUX.2-klein-4B, SpatialFlow-GRPO outperforms Flow-GRPO on GEdit-Bench, ImgEdit-Bench, and MultiEditBench. The results show that SpatialFlow-GRPO converts local feedback into spatially aligned update signals and improves editing quality.
[CV-47] Rolling Shutter Relative Pose Estimation Made Practical
链接: https://arxiv.org/abs/2606.26863
作者: Daniel Barath
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Rolling shutter (RS) cameras equip virtually all consumer devices, yet RS-aware relative pose estimation has remained impractical: the state-of-the-art solver requires a minimum of 20 point correspondences, making RANSAC-based robust estimation prohibitively expensive due to the exponential dependence of the iteration count on the sample size. We make RS relative pose estimation practical by introducing affine correspondences (ACs) into the RS two-view geometry. We derive novel \emphRS-corrected affine constraints that account for the coupling between point perturbations and the row-dependent essential matrix, providing two equations per correspondence beyond the standard epipolar constraint. Building on these constraints, we develop a linearized algebraic solver that estimates pose and RS motion from only 7 ACs. The solver exploits the physical smallness of RS parameters to linearize the constraints, eliminates the 12 RS unknowns via null-space projection, and solves the remaining degree-20 system via action matrices in 1.2,ms. On the TUM RS benchmark, our method achieves the best pose and RS parameter accuracy among all tested methods and, uniquely among RS solvers, provides accurate translational velocity estimates – which are poorly conditioned from point correspondences alone due to a \vecv - \vect coupling. On the global-shutter EuRoC MAV dataset, the solver achieves comparable accuracy to the standard 5-point algorithm, demonstrating that it generalizes well to the GS setting. Code is at this https URL.
[CV-48] Appearance-Preserving Refinement of Generated 3D Assets for Monochromatic Fabrication
链接: https://arxiv.org/abs/2606.26850
作者: Chentao Shen,Chen Jia,Mingjie Huang,Zhuang Zhang,Haisen Zhao,Xiangru Huang
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: For preprint
Abstract:Recent advances in 3D mesh generation have enabled the creation of visually realistic assets. However, much of their visual fidelity is encoded in textures rather than geometry. When such assets are fabricated using monochromatic materials, texture information is largely lost, causing visually important details to disappear even when the original geometry is faithfully preserved. A key challenge is that the geometric perturbations required to recover texture-dependent appearance cues often introduce sharp local features and high-frequency surface structures, which may increase stress concentration and fabrication risk. In this paper, we present GenMF, an appearance-oriented geometry refinement framework for monochromatic fabrication. GenMF transforms texture-dependent visual cues into geometry-induced shading effects and formulates geometry refinement as a balance between appearance preservation and fabrication-oriented robustness. To discourage structurally and narrow the gap between simulation and physical manufacturing, we further introduce a differentiable stress-aware regularization based on a learned thermal-stress predictor. Experimental results demonstrate that GenMF significantly improves appearance preservation under monochromatic rendering while reducing stress concentration under a consistent thermo-mechanical simulation setting. Physical 3D printing examples further show that the refined geometries preserve more recognizable visual details while remaining suitable for fabrication. These results suggest that appearance-aware geometry refinement provides an effective bridge between generated 3D assets and fabrication-ready monochromatic objects.
[CV-49] Liquid Fusion of Heterogeneous Representations Towards General Salient Object Detection
链接: https://arxiv.org/abs/2606.26849
作者: Ke Chen,Ling Zhou,Guangqi Jiang,Gengshen Wu,Yi Liu,Shoukun Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 5 figures
Abstract:General Salient Object Detection (SOD) aims to identify and segment visually interesting objects from uni-modality or multi-modality scenes, recently advanced by cutting-edge State Space Models (SSMs). However, a critical limitation of current approaches is their neglect of the inherent spectral biases exhibited by different neural network paradigms. By digging to the dataset-level spectral analysis of Convolutional Neural Networks (CNNs) and SSMs, their semantic representations are inherently complementary based on their complementary frequency preferences. Inspired by this, we harmonize heterogeneous representations from SSMs and CNNs to bridge their spectral biases for general salient object detection. To this end, inspired by the dynamic information propagation of Liquid Neural Networks (LNNs), we introduce a liquid fusion to dynamically integrates features from two backbones, including VMamba and ConvNeXt, referred to Liquid Fusion Network (LFNet). Concretely, by treating the continuous VMamba features and ConvNeXt features as evolving states and exogenous stimulus, respectively, LFNet employs a dynamic gating mechanism for content-aware feature aggregation. Crucially, this state-stimulus paradigm enables to scale to multi-modal cues, resulting in flexibility in general SOD. Besides, a Saliency-Guided Upsampling (SGU) operator to propagate the features to the shallow layer, which leverages a spectral-spatial co-design to suppress upsampling artifacts while preserving semantics. Extensive experiments across five diverse tasks (RGB, RGB-D, RGB-T, VSOD, and VDT) demonstrate that LFNet achieves state-of-the-art performance, offering a superior trade-off between detection accuracy and model efficiency. Code has been released at this https URL.
[CV-50] Ordinal Neural Collapse as a Representation Prior for Visual Navigation
链接: https://arxiv.org/abs/2606.26839
作者: E-In Son,Jung-Taak Kim,Seung-Woo Seo
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 14 figures. Supplementary material included
Abstract:Learning robust navigation policies directly from visual observations remains a fundamental challenge in vision-based robotic navigation. In end-to-end imitation learning approaches, the visual encoder and action decoder are jointly optimized using a single action loss, which provides only an indirect supervisory signal to the encoder. This indirect supervision frequently results in the encoder learning ambiguous, action-agnostic representations. The problem is further complicated by substantial variations in scene structure and appearance across diverse environments, as well as the prevalence of visual distractors inherent to real-world navigation settings. Such action-agnostic features cause the navigation policy to produce inconsistent actions at ambiguous decision points, leading to navigation failure. To overcome these limitations, we propose ORION (Ordinal Neural Collapse for Visual Navigation), a method that explicitly organizes the encoder’s representation space according to the ordinal structure of navigation actions. In the context of goal-directed navigation, ego-centric control categories from Far Left to Far Right exhibit a natural ordinal relationship in which neighboring classes share similar visual contexts, while semantically opposing classes differ substantially in appearance. We encourage class representations to be arranged sequentially along a single discriminative axis, while suppressing off-axis variance within each class. The pretrained encoder is then integrated into a diffusion-based navigation framework, and the full pipeline is fine-tuned end-to-end. Extensive experiments in both simulation and real-world settings show that ORION consistently outperforms end-to-end and neural collapse baselines in navigation success rate and goal progress, with notable gains in visually challenging scenarios such as complex multi-way intersections.
[CV-51] Identifying the Unknown: Prompt-Free Open Vocabulary Anomaly Recognition for Robot-Object Interaction
链接: https://arxiv.org/abs/2606.26829
作者: Philipp Allgeuer,Jan-Gerrit Habekost,Stefan Wermter
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: International Conference on Artificial Neural Networks 2026
Abstract:Robots operating in real-world environments must in general be able to recognize previously unseen objects. As robotic systems move toward open-world autonomy, there is a growing, yet largely unmet, need for open vocabulary object detectors that are prompt-free and efficient enough for continuous deployment. We present AnomNOVIC, a two-stage known-workspace framework that combines a masked autoencoder (MAE) trained for anomaly detection, with NOVIC, a powerful real-time prompt-free open vocabulary image classifier. The MAE produces generic object-agnostic bounding boxes, allowing NOVIC to classify salient image regions without requiring a predefined candidate class list. We evaluate AnomNOVIC against strong open vocabulary baselines in a tabletop robot-object environment featuring the NICOL humanoid robot, reaching 47.1% AP / 57.5% AP50 for prompt-free recognition, and 59.0% AP / 72.5% AP50 if class candidates are provided. Across additional datasets, including an in-the-wild test set with 48 unique objects, AnomNOVIC reaches up to 82.6% prompt-free detection and classification accuracy. These results significantly surpass all tested open vocabulary baselines, including YOLO-World-v2, OWLv2, and YOLOE.
[CV-52] Learning Adversarial Augmentation Policies for Robust Garlic Seedling Detection
链接: https://arxiv.org/abs/2606.26828
作者: Soeun Lee,Chanho Kim,Yeji Kang,YoungKi Hong,Byeongkeun Kang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages
Abstract:Accurate seedling detection during early growth stages is essential for timely replanting and effective crop management in precision agriculture. However, existing studies are mostly evaluated under relatively stable imaging conditions, such as UAV imagery or greenhouse environments, leaving robust detection under severe and spatially heterogeneous illumination in ground-based outdoor monitoring insufficiently explored. In addition, many illumination-robust detection methods rely on additional enhancement or feature-extraction modules, which increase inference-time overhead and are not tailored to seedling detection and downstream missing seedling localization. To address these gaps, we construct a new garlic seedling dataset captured using a ground-based monitoring platform under real outdoor field conditions with highly variable illumination. We further propose an illumination-robust seedling detection framework based on adversarial augmentation policy learning. The proposed method jointly optimizes a stochastic augmentation policy agent and an object detector, enabling the detector to learn robust representations under challenging visual conditions. A structural penalty is introduced to prevent unrealistic distortions while encouraging challenging augmentations during training. Extensive experiments show that the proposed approach achieves an AP _50 of 91.6%, improving the baseline by 0.9 percentage points and outperforming the previous best-performing method by 0.2 percentage points. For downstream missing seedling localization, it achieves 75.0% precision and a 67.0% F1-score, improving the baseline by 4.8 and 2.0 percentage points, respectively. These results demonstrate the effectiveness of the proposed framework for practical ground-based agricultural monitoring under complex outdoor lighting conditions without additional inference-time computational overhead.
[CV-53] Multi-modality Image Fusion under Adverse Weather: Mask-Guided Feature Restoration and Interaction ECCV2026
链接: https://arxiv.org/abs/2606.26812
作者: Xilai Li,Xiaosong Li,Haishu Tan,Tao Ye,Huafeng Li,Hongbin Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2026
Abstract:Multi-modality image fusion (MMIF) enhances scene representation by exploiting complementary cues from different modalities. Adverse weather, however, causes significant image degradation, disrupting feature representation and requiring simultaneous feature restoration and cross-modal complementarity. Existing methods often struggle with effective representation learning under such conditions, limiting their practical performance. To address these challenges, we propose a mask-guided MMIF method that integrates feature restoration and interaction. We first introduce “Pseudo Ground Truth” to simplify training, promoting faster and more effective feature learning. Then, we design a mask generation mechanism based on the mapping relationship between the fused result and the source images, quantifying the relative contribution of each modality during the fusion process. By incorporating the proposed mask-guided cross-modal cross-attention mechanism, the network is encouraged to selectively attend to informative features during modality interaction, mitigating the risk of overfitting to the static distribution of the “Pseudo Ground Truth”. Additionally, we propose a mask-guided learning strategy and a task-coupled degradation-aware learning strategy to balance feature restoration and interaction. Extensive experiments on synthetic and real-world datasets demonstrate that our method surpasses state-of-the-art approaches in visual quality, quantitative metrics, and downstream tasks. The source code is available at this https URL.
[CV-54] Improving Vision-Language-Action Model Fine-Tuning with Structured Stage and Keyframe Supervision
链接: https://arxiv.org/abs/2606.26801
作者: Yuan Xu,Yixiang Chen,Kai Wang,Jiabing Yang,Peiyan Li,Qisen Ma,Yan Huang,Liang Wang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language-Action (VLA) models have shown strong potential for generalizable robotic manipulation. During fine-tuning, however, action supervision applies equally across all timesteps, without structured supervision on which manipulation stage the robot is in or what the next gripper-event target should be. This causes failures to concentrate around challenging gripper-event transitions. To address this, we propose StaKe, a plug-in auxiliary supervision framework that automatically derives two complementary signals from demonstration gripper states without manual annotation: a stage classifier that identifies the current manipulation stage, and a keyframe predictor that estimates the target joint action at the next gripper transition. Both are modeled as lightweight auxiliary heads that enrich the learned representations during training, while leaving the base VLA policy architecture and inference loop unchanged. Experiments on bimanual simulation and single-arm Franka real-robot tasks show that StaKe consistently improves success rates (relative gains of 14% and 56%, respectively), with larger improvements on longer-horizon tasks that involve more gripper-event transitions. Ablation studies validate each design choice, and qualitative analysis confirms that the learned representations faithfully track manipulation stages. These results indicate that structured supervision is an effective and general strategy for enhancing VLA fine-tuning in long-horizon manipulation. Project website: this https URL
[CV-55] NaviCache: Test-Time Self-Calibration Caching for Video Generation ICML2026
链接: https://arxiv.org/abs/2606.26795
作者: Zheqi Lv,Zhibo Zhu,Jinke Wang,Qi Tian,Shengyu Zhang,Zhengyu Chen,Chengxi Zang,Zhou Zhao,Fei Wu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Published at ICML 2026: Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026
Abstract:Video Diffusion Models (VDMs) is constrained by immense computational costs. While offline calibration-based acceleration suffers from calibration data dependency, prohibitive calibration duration, and susceptibility to distribution shifts, offline calibration-free methods eliminate these hurdles. However, since they rely on instantaneous zero-order approximations where the mapping between input and output differences varies in real-time, they are susceptible to observational noise and ignore the intrinsic momentum within the diffusion trajectory. In this paper, we propose NaviCache, a plug-and-play test-time self-calibration method re-conceptualizing feature evolution as an Inertial Navigation System (INS) problem. NaviCache bridges the fundamental domain gap and the non-stationary nature of diffusion by modeling the relative coupling between input and output variations. We introduce a dual-state estimation architecture that adaptively tracks the feature change ratio and its latent drift, initialized via a specialized Initial Alignment phase. By integrating a time-dependent noise schedule with an uncertainty-aware Measurement Update mechanism, NaviCache provides a theoretically grounded mechanism for error-bounded computation skipping. Extensive experiments on the HunyuanVideo, Wan, and Open-Sora series demonstrate that NaviCache exhibits more accurate error judgment for computation skipping and achieves outstanding comprehensive performance.
[CV-56] Reason CLIP-58M: Visually Grounded Commonsense Reasoning Supervision for CLIP ECCV2026
链接: https://arxiv.org/abs/2606.26794
作者: Sicheng Zhang,Muzammal Naseer,Binzhu Xie,Naufal Suryanto,Shi Qiu,Jamal Bentahar,Naveed Akhtar,Mubarak Shah
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ECCV2026
Abstract:CLIP and its variants are widely adopted visual backbones in multimodal systems, but their pretraining remains dominated by descriptive image-text alignment. As downstream applications increasingly demand visually grounded commonsense inference and compositional reasoning, it remains unclear whether CLIP-style encoders can support such reasoning without architectural changes. To address this, we present ReasonCLIP-58M, a continual pretraining framework that integrates large-scale reasoning supervision into CLIP-style models through our two-stage strategy, which progressively integrates reasoning signals while preserving descriptive alignment, followed by category-structured reasoning supervision. To support this framework, we construct two complementary datasets and a benchmark: ReasonLite-42M, with open-form, visually verifiable reasoning captions; ReasonPro-16M, with category-specific reasoning supervision; and RCLIP-Bench for diagnostic evaluation of visually grounded reasoning. We train a family of ReasonCLIP that improves visually grounded commonsense and compositional reasoning while also enhancing zero-shot retrieval performance. As a drop-in visual encoder for multimodal large language models such as LLaVA-NeXT, ReasonCLIP delivers consistent gains without additional inference cost, demonstrating that structured reasoning supervision enhances the expressive capacity of CLIP-style visual representations. All datasets, models, and training code are available at this https URL.
[CV-57] Event-based Gaze Control System for Accurate Real-time Spin Estimation in Professional Ball Games
链接: https://arxiv.org/abs/2606.26780
作者: Yunpu Hu,Fabian Schilling,Valentina Cavinato,Asude Aydin,Agis Politis,Ricardo Tapiador Morales,Kirk Y.W. Scheper,Peter Dürr,Naoya Takahashi
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Spin plays a crucial role in many ball sports due to its effect on the trajectory of the ball. Vision-based estimation of the ball’s spin during a game with conventional cameras is challenging due to the ball’s small size, high speed, and fast rotation. To address these challenges, we propose an event-based active vision system that can track unmodified balls and measure their spin in real-time. The system consists of an event camera for its high temporal resolution and minimal motion blur, high-speed pan/tilt galvanometer mirrors to keep the ball in the field of view, and a low-latency focus-tunable telephoto lens to increase the spatial resolution on the ball and keep it in focus. To track the ball, we use a hybrid approach that combines 2D event-based detection for centering and 3D positions from a ball localization system for re-initialization. For high-accuracy spin estimation, we propose an offline method that performs contrast maximization on the sphere (s-CMax). This method achieves state-of-the-art accuracy on static balls across multiple sports (table tennis, baseball, tennis, and golf), with mean magnitude and axis errors of 2.1% and 4.0 degrees, respectively. We then develop a low-latency online method for table tennis as a case study in real-time applications. This method uses an uncertainty-aware convolutional neural network trained on pseudo-ground-truth spin labels from the offline approach, combined with a GPU-accelerated batch implementation of contrast maximization for refinement. We demonstrate reliable tracking and spin estimation with a three-view setup during professional table tennis matches, with high accuracy (8.8% magnitude and 6.4 degrees axis mismatch), 3 ms latency, and 750 Hz throughput.
[CV-58] LearniBridge: Learnable Calibration of Feature Caching for Diffusion Models Acceleration ICML2026
链接: https://arxiv.org/abs/2606.26778
作者: Xuyue Huang,Zhe Chen,Wang Shen,Xiao-Ping Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICML 2026
Abstract:Diffusion Transformers (DiTs) have driven substantial progress in image and video generation but suffer from prohibitive computational costs. Feature caching accelerates inference by reusing intermediate representations. Existing methods rely on historical features for implementation simplicity, yet suffer from severe error accumulation at high acceleration ratios. To address this limitation, we investigate the nature of the requisite feature correction. We demonstrate that the optimal calibration update is characterized by a shared low-rank subspace across diverse prompts. Guided by this structural insight, we propose LearniBridge, a learnable calibration mechanism for feature caching that bridges multiple timesteps through lightweight LoRA updates. This mechanism enables effective calibration requiring only 3-5 training samples. Extensive experiments on image and video generation show that LearniBridge achieves up to 5.87\times , 5.75\times , and 4.10\times acceleration on FLUX, HunyuanVideo, and WAN2.1, respectively. On WAN2.1, it improves VBench by 1.28% over the previous SOTA at 4.10\times acceleration. Our code is available at this https URL.
[CV-59] ResilPhase: Plug-and-Play Phase Mapping and Noise-Resilient Macro-Trajectory Extrapolation for Diffusion Acceleration ECCV2026
链接: https://arxiv.org/abs/2606.26769
作者: Qicheng Zhao,Yu Li,Qi Sun,Zheyu Yan
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026
Abstract:The adoption of powerful diffusion models is hindered by their significant inference latency. Recent ``cache-then-forecast’’ schemes alleviate this issue by accelerating DiTs using derivative-based polynomials, but they suffer from severe quality degradation at high acceleration ratios. Our analysis reveals its root cause: the discrete extrapolation performed on representations that are misaligned with the continuous diffusion trajectory and are numerically unstable. Thus, accelerated DiTs suffer from accumulated spatial errors, noisy derivative amplification, and high-order instability. We therefore reformulate accelerated inference as stable macro-trajectory extrapolation in ordinary differential equation (ODE) space. Instead of predicting intermediate features, we align forecasting with the model’s Global Drift (GD), i.e., the end-to-end state evolution, thereby eliminating feature inconsistency and memory overhead. However, even this smooth macro-trajectory remains vulnerable to the derivative fallacy: its higher-order temporal derivatives are intrinsically noisy. Thus, we introduce a derivative-free barycentric Lagrange extrapolator to effectively bypass derivative instability and approximation error. We further propose a bounded Phase Mapping that regularizes the extrapolation domain, suppressing oscillatory error growth. These elements collectively constitute ResilPhase, a noise-resilient acceleration framework. Experiments on FLUX.1-dev and HunyuanVideo demonstrate state-of-the-art fidelity under aggressive acceleration ratios.
[CV-60] Anatomy-Guided Residual Motion Diffusion for Controllable 4D Cardiac MRI Synthesis
链接: https://arxiv.org/abs/2606.26764
作者: Yiheng Cao,Gustavo Andrade-Miranda,Jiatian Zhang,Lingxiao Zhao,Xin Gao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Developing robust artificial intelligence models for 4D (3D + time) medical imaging is constrained by limited annotated data, inter-device domain shifts, and privacy restrictions. To address this, we propose a 4D controllable generative framework for anatomically consistent data augmentation. A semi-supervised variational autoencoder learns a compact latent representation of anatomical volumes while jointly predicting aligned segmentation masks in a unified framework. Anatomical structure is then disentangled from temporal dynamics through a cascaded latent diffusion model (LDM). A static LDM generates subject-specific anatomy conditioned on clinical priors (diagnosis and volumes measures) and a subsequent motion LDM estimates residual latent motions, ensuring strict temporal coherence across the 4D sequence. The proposed approach was evaluated on cine cardiac MRI as a representative 4D imaging application. Experiments across multiple datasets demonstrate high controllability of static anatomy (Pearson r 0.8) and strong temporal coherence (FVD = 288.08). In cross-vendor generalization experiments, augmenting training sets with synthetic 4D sequences significantly improves downstream segmentation performance. Using nnU-Net, the proposed augmentation strategy improves the average Dice score by 1.4% and reduces the Hausdorff Distance by 3.0mm compared to training on real data alone, for the left ventricle, Dice improves by 2.8% with a 5.4mm reduction in boundary error. Overall, this framework provides a scalable and controllable solution for 4D medical image synthesis, supporting the development of more robust models with limited annotations and cross-vendor variability. Code available on this https URL.
[CV-61] Calibrated Harmonic Overlaid Implicit Neural Representations for Multi-Dimensional Data ECCV2026
链接: https://arxiv.org/abs/2606.26763
作者: Honghang Chen,Xiujun Zhang,Xiaoli Sun,Mingqing Xiao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV2026 Accept
Abstract:Implicit neural representation (INR) has emerged as a powerful prior for multi-dimensional data (e.g., multispectral images and videos). However, most INR methods employing periodic activation functions (e.g., Sine) predominantly rely on function composition. This mechanism introduces optimization instability as network depth increases, thereby limiting their performance. Meanwhile, these methods fail to incorporate proper physical priors to effectively alleviate spectrum bias. To address these issues, inspired by the commonalities between deep periodic networks and generalized Fourier series, we propose a novel Calibrated Harmonic Overlaid Implicit Neural Representation (CHOIR). Specifically, we utilize Coordinated Harmonic Superposition (CHS) to replace the conventional function composition used in most INRs, thereby ensuring optimization stability when scaling network depth. Furthermore, we introduce a Perceptual Spectrum Calibration (PSC) to mitigate spectrum bias. This calibration embeds the ubiquitous power-law spectrum prior of natural images and adjusts the globally fixed spectrum towards a physically plausible log-uniform distribution. Extensive experiments on various multidimensional data recovery problems demonstrate that our method achieves superior performance over state-of-the-art approaches. Code is available at this https URL.
[CV-62] ProtoKV: Streaming Video Understanding under Delayed Query with Summary-State Memory ICML2026
链接: https://arxiv.org/abs/2606.26762
作者: Le Tu Ngoc Minh(KAIST),Jinyeong Lim(KAIST),Dongsu Han(KAIST)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 20 pages, 4 figures, Accepted to ICML 2026
Abstract:Streaming video understanding (SVU) must answer queries that arrive asynchronously while visual tokens stream continuously under strict GPU-memory and query-time latency budgets. A key challenge is delayed query: decisive cues may appear briefly, yet many subsequent updates occur before the query arrives, increasing the risk that those cues are evicted or diluted under bounded memory. We propose ProtoKV, a constant-footprint SVU memory that represents far history as a fixed-capacity summary state rather than retaining token instances. ProtoKV keeps an exact near-window KV cache and aggregates older content into a semantic-spatial prototype bank with residual statistics. At query time, each prototype is exposed through a bounded pseudo-token interface that is drop-in compatible with standard attention. Under matched budgets and comparable query-time cost, ProtoKV improves accuracy by up to 12.5 points over token-retention baselines on SVU benchmarks in the long-delay regime, with gains that grow as query delay increases.
[CV-63] Capacity-Controlled Multi-View Stylization of 3D Gaussian Splatting ECCV2026
链接: https://arxiv.org/abs/2606.26754
作者: Zhihao Wen,Yixin Yang,Bojian Wu,Yang Zhou,Dani Lischinski,Daniel Cohen-Or,Hui Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026. Project page: this https URL
Abstract:While 3D Gaussian Splatting (3DGS) provides an efficient and explicit representation for novel view synthesis, enforcing stylistic coherence across viewpoints remains challenging. Existing 3D stylization methods typically apply 2D feature-matching losses independently per rendered view, which leads to unstable style allocation, many-to-one feature reuse, and limited cross-view consistency. We propose a capacity-controlled framework for multi-view stylization of 3DGS, grounded in optimal transport. Specifically, we reformulate local style matching as a semi-balanced optimal transport problem. By introducing explicit column-capacity constraints with tunable strength, our formulation mitigates many-to-one matching and enables controllable allocation of style features. This transport-based objective provides a principled mechanism for balancing feature coverage and stylistic diversity while maintaining stable correspondences across viewpoints. To further enhance cross-view coherence, we incorporate a novel cross-view matching guidance to constrain correspondences between scene content and style patterns. In addition, we introduce several geometric regularizations to enhance the vanilla 3DGS, thereby enabling optimized Gaussian primitives to represent finer-grained textures during stylization. Extensive experiments demonstrate that our approach significantly improves multi-view stylistic consistency and produces stable, expressive 3D stylizations while preserving the core semantic structure of the scene.
[CV-64] Depth-Semantic Alignment and Affinity-Guided Fusion for Structured Radar Point Cloud Generation
链接: https://arxiv.org/abs/2606.26743
作者: Amjad Hussain,Xin Qiu,Fuyuan Ai,Yuchen Tan,Zecheng Li,Chunyi Song,Wenjie Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Point clouds are an important carrier of three-dimensional spatial information, and their quality directly affects the performance of downstream perception tasks such as object detection and tracking. However, millimeter-wave radar point clouds are typically sparse, noisy, and structurally incomplete. To address these limitations, this paper proposes a multimodal point cloud generation method based on vision-radar fusion. The proposed method leverages image semantic information to impose structural constraints and achieve spatial alignment for radar point clouds, while incorporating a sparse completion strategy to enhance point density and recover missing structures. The generated point clouds are further evaluated in object detection and tracking tasks. Experimental results demonstrate that the proposed method effectively improves point cloud quality and enhances the detection accuracy and robustness of perception models in complex environments, providing a practical solution for multisensor point cloud generation and intelligent perception systems.
[CV-65] PressMimic: Pressure-Guided Motion Capture and Control for Humanoid Robot Imitation
链接: https://arxiv.org/abs/2606.26741
作者: Yi Lu,Shenghao Ren,Tianyu Xiong,Zhaoxiang Li,Jiaqi Li,He Zhang,Tao Yu,Qiu Shen,Xun Cao
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Humanoid motion imitation requires not only accurate perception of human kinematics but also faithful reproduction of physical interactions with the environment. However, existing pipelines rely primarily on vision-based motion capture and kinematic imitation, largely ignoring contact dynamics, leading to artifacts such as foot sliding, floor penetration, and unstable behaviors. In this work, we revisit humanoid motion imitation from the perspective of physical grounding and leverage pressure as a unified modality across perception and control. We present PressMimic, a framework that integrates pressure into the full pipeline from motion capture to humanoid control. In the perception stage, we introduce FRAPPE++, a multimodal model that fuses RGB and pressure to jointly estimate 3D pose and global motion, where pressure provides explicit contact and support constraints to resolve ambiguity in vision-based estimation. In the control stage, we propose a pressure-supervised policy (PSP) that incorporates pressure-derived signals into reinforcement learning, enabling physically consistent contact patterns during execution. We further construct MotionPRO, a large-scale dataset with synchronized RGB, pressure, and motion capture data. Experiments show that pressure improves motion estimation accuracy, trajectory consistency, and execution stability. These results demonstrate that pressure serves as an effective physical grounding signal, bridging perception and control for physically consistent humanoid motion imitation.
[CV-66] LiveEdit: Towards Real-Time Diffusion-Based Streaming Video Editing ECCV2026
链接: https://arxiv.org/abs/2606.26740
作者: Xinyu Wang,Chongbo Zhao,Fangneng Zhan,Yue Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026, Project page: this https URL
Abstract:Streaming video editing has made rapid progress, yet practical deployment is still limited by two core issues: maintaining stable backgrounds and non-edited regions over time, and achieving the low latency required for real-time interactive scenarios. Meanwhile, recent streaming video generation methods are mostly developed for synthesis and cannot be directly applied to editing due to the strict preservation requirement and region-specific control. In this work, we present a novel streaming video editing framework that performs causal, frame-by-frame editing with strong content preservation and real-time responsiveness. Our key design is a three-stage distillation pipeline that progressively transfers editing capability from a powerful bidirectional foundation model to an efficient unidirectional streaming editor, enabling stable long-horizon edits without sacrificing visual fidelity. To further support real-time deployment, we introduce an AR-oriented mask cache that reuses region-related computation across frames, substantially reducing redundant processing and accelerating inference. Finally, we establish a dedicated benchmark for streaming video editing. Extensive evaluations demonstrate that our method achieves state-of-the-art visual quality among streaming baselines while drastically boosting inference speed to 12.66 FPS, making it suitable for interactive and augmented reality applications.
[CV-67] Do Image Editing Models Understand Lighting?
链接: https://arxiv.org/abs/2606.26738
作者: Tim Küchler,Johann-Friedrich Feiden,Matthias Nießner,Carsten Rother
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While recent advancements in generative image editing models have achieved stunning visual fidelity, it remains an open question whether these systems possess an intrinsic knowledge of real-world lighting. Existing benchmarks typically evaluate high-level plausibility of perceptual light transport on curated internet imagery, using VLMs or human judgement, or they rely on synthetically generated datasets. In this work, we introduce the 3D-anchored Light Probe (3DLP) benchmark, for which we have captured a new high-fidelity HDR dataset of real-world lighting changes. The dataset consists of 1K image pairs of diverse indoor scenery in which light probes are physically turned on and off. To allow for a granular performance analysis, we annotated specific image regions such as cast shadows or metallic surfaces. With this data, we evaluate a range of state-of-the-art image editing models by measuring how well their light probe edits align with reality. The evaluation uses two new scores to compensate for AI-generated photographic effects, such as adjusted white balance. Our results show that the overall performance of models differs considerably, with differences slightly less pronounced for specular highlights. The best image editing models are remarkably consistent with real-world physics, however, they still leave room for improvement. We observe that image regions that receive less light from the light probe are more prone to errors for all models. Furthermore, building on their success in evaluating macroscopic lighting plausibility, we test VLMs on our task but find that they are unsuitable for pixel-level light transport analysis. We will make the benchmark, together with the real-world dataset, publicly available to encourage future research on this topic.
[CV-68] Robust Onion: Peeling Open Vocab Object Detectors Under Noise ECCV
链接: https://arxiv.org/abs/2606.26734
作者: Priyank Pathak,Mukilan Karuppasamy,Aaditya Baranwal,Shruti Vyas,Yogesh S Rawat
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at The 19th European Conference on Computer Vision (ECCV)
Abstract:The impact of real-world noise on Open Vocabulary Object Detectors (OV-ODs) remains poorly understood due to their architectural complexity. We present our comprehensive analysis Robust Onion, an empirical study that uses controlled synthetic visual degradations to peel OV-ODs layer-by-layer, revealing how, why, and where robustness degrades, systematically analyzing feature collapse. Our findings reveal that models with similar vision backbones exhibit comparable robustness, driven by similar feature collapse at similar layers, while factors such as pretraining strategy, architectural nuances, and caption supervision contribute little. Robustness is primarily governed by the image domain rather than annotations, explaining the similar robustness impact on COCO and LVIS, and why datasets like ODinW-13 can give an impression of inflated robustness due to large, isolated objects. Finally, we validate our insights by improving robustness on real-world BDD100K, WiderFace, and VisDRONE via our lightweight plug-and-play NN TK0 approach, using 96x fewer trainable parameters than end-to-end training. We also explain the prior works’ robustness observations.
[CV-69] Full spectrum Unlearnable Examples via Spectral Equalization ICML
链接: https://arxiv.org/abs/2606.26719
作者: Jiale Cai,Gezheng Xu,Zhihao Li,Ruiyi Fang,Ruizhi Pu,Di Wu,Qicheng Lao,Charles Ling,Boyu Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: to be published in ICML
Abstract:Unlearnable examples (UEs) protect training data by injecting imperceptible perturbations so that models fail to extract exploitable representations. In this paper, we reveal that existing UEs exhibit a critical failure once low-pass filtering is applied, indicating that the effective perturbation signals for unlearnability concentrate predominantly in high frequencies. Hence, we argue that reliable UEs should remain effective across the full spectrum. To this end, we propose Full-spectrum Unlearnable examples via Spectral Equalization (FUSE), which aims to generate spectrum-agnostic perturbations by equalizing the contributions from different bands and enforcing cross-band consistency. Specifically, FUSE adopts a Random Spectral Masking (RSM) strategy during generator training, which randomly removes a contiguous frequency band, forcing the remaining bands to maintain unlearnability. In addition, FUSE further integrates Cross-Band Guidance (CBG), which enforces mutual consistency between high- and low-frequency components, thereby further enhancing low-frequency unlearnability and regulating high-frequency perturbations to preserve the semantic fidelity of images. Extensive experiments across multiple datasets, architectures, and spectral filtering demonstrate the strong protection achieved by FUSE.
[CV-70] A Latent ODE Approach to Spatiotemporal Modeling of Cine Cardiac MRI
链接: https://arxiv.org/abs/2606.26718
作者: David Brüggemann,Ekaterina Krymova,Firat Özdemir,Jochen von Spiczak,Sebastian Kozerke,Samia Mora,Robert Manka,Mathieu Salzmann,Olga V. Demler
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cardiac magnetic resonance imaging (CMR) captures rich spatiotemporal information about ventricular structure and motion, but conventional risk models use only a few image-derived indices from selected cardiac phases. We present a latent dynamical model that encodes bi-ventricular anatomy and full-cycle cine motion as a continuous latent trajectory, using heart-rate-aware neural ordinary differential equation (ODE) dynamics and a graph-based mesh autoencoder to reconstruct anatomically consistent 3D+t ventricular motion. A covariate-conditioned prior defines the expected end-diastolic latent state, and a Cox proportional hazards model tests whether deviations from this prior predict incident heart failure. We studied 72,386 UK Biobank participants without baseline cardiovascular disease, including 367 incident heart failure events. In a held-out evaluation subset, adding the latent score to refitted pooled cohort equations improved the stratified C-index from 0.704 to 0.785, compared with 0.764 for seven established cardiac markers. Compared with non-graph and non-ODE approaches, the proposed model gave the best trade-off between reconstruction fidelity, generative realism, and downstream prognostic performance. These results suggest that continuous full-cycle modeling of ventricular motion provides informative cardiac phenotypes beyond conventional CMR summaries, while external validation in more representative patient cohorts is required before clinical risk-prediction use.
[CV-71] Extracting Neural Materials from Multi-view Images
链接: https://arxiv.org/abs/2606.26715
作者: Kim Youwang,Jon Hasselgren,Peter Kocsis,Andrea Weidlich,Tae-Hyun Oh,Jacob Munkberg
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project website: this https URL
Abstract:Neural materials can represent complex specular reflections and scattering effects in a compact, universal basis. However, acquiring and authoring such materials remains challenging. We present NeuMatEx, a differentiable inverse rendering method for extracting spatially varying neural materials from images. The nonlinear structure of neural material latent spaces makes optimization with naive inverse rendering infeasible. To address this, we train a Large Material Reconstruction Model (LMRM) that directly predicts initialbase color, neural material latents, and aleatoric uncertainty guides from images. This material prior provides a good initialization and better constrains our subsequent optimization using inverse path tracing. The predicted uncertainty further helps by anchoring high-confidence regions more tightly to the LMRM prediction, preventing lighting and complex specular effects from being baked into materials. Experiments on synthetic and real assets show that NeuMatEx extracts complex materials with better visual quality and material decomposition than PBR-based methods.
[CV-72] Mask to Concept: Auto-Promptable SAM3 via Efficient Test-Time Concept Embedding Search for Few-Shot Annotation MICCAI2026
链接: https://arxiv.org/abs/2606.26711
作者: Quan Zhou,Shaoqing Zhai,Qiang Hu Jia Chen,Qiang Li,Zhiwei Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI 2026
Abstract:Transforming foundation segmentation models from human-prompted tools into auto-promptable annotators is critical for scalable medical data annotation. Current methods commonly depend on external feature matchers or auxiliary networks to automate geometric prompting, but introducing architectural overhead and limiting performance scalability. Although SAM3 natively supports concept segmentation via reusable text prompts, its direct use in medical imaging is hindered by a lack of fine-grained clinical knowledge and the ambiguity of human-written descriptions. In this work, we propose Mask to Concept (M2C), an efficient framework that adapts SAM3 for medical few-shot annotation without external modules, parameter retraining, or manual text engineering. Using only a few labeled images, M2C enables SAM3 to automatically search for transferable visual concepts entirely within its frozen architecture: it initializes a learnable concept embedding, uses it to prompt segmentation, and updates the embedding by gradients of minimizing the concept segmentation error. We further introduce a Hybrid Uncertainty Estimation (HUE) module that calculates the prediction entropy and maps concept predictions back to the box prompts, measuring concept-geometry prompting inconsistency. Highly uncertain samples are flagged actively for human correction, and the corrected masks are then fed back to M2C to continuously search for more precise concept embeddings, forming a self-enhancing annotation loop with minimal expert effort. Experiments on medical segmentation benchmarks show that our method achieves SOTA few-shot segmentation performance and outstanding annotation efficiency, offering a practical and efficient pathway toward scalable medical image labeling. Codes are at this https URL.
[CV-73] Intracranial Aneurysm Classification and Segmentation via Tri-Axial ROI and Multi-Task Learning
链接: https://arxiv.org/abs/2606.26706
作者: Pengcheng Shi,Kaiyuan Yang,Houjing Huang,Jiawei Chen,Yan Lu,Jiaqi Liu,Murong Xu,Bjoern Menze,Xinglin Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Intracranial aneurysms are often asymptomatic until rupture, which carries high mortality. Rupture risk assessment and treatment planning depend on both aneurysm morphology and anatomical location, yet existing automated methods remain limited to binary detection without fine-grained anatomical classification or multi-class segmentation. We present a multi-task framework that simultaneously performs multi-label classification, multi-class aneurysm segmentation, and multi-class vessel segmentation across 13 anatomical locations and four imaging modalities (CTA, MRA, T2, T1-post). Our two-stage approach combines a fast 2D tri-axial Region of Interest (ROI) extraction method with a 3D multi-task nnU-Net backbone. A dual-decoder design mitigates the extreme volume imbalance between aneurysm and vessel classes, while cross-attention pooling and modality-specific auxiliary heads improve feature learning across heterogeneous inputs. Our two-fold ensemble achieved 2nd place in the RSNA 2025 Intracranial Aneurysm Detection challenge. Code, model weights, and a 3D Slicer plugin are publicly available.
[CV-74] PhysEditWorld: A Large-Scale Dataset Toward Physics-Editable World Models
链接: https://arxiv.org/abs/2606.26694
作者: Bin Hu,Yanwen Ma,Jiehui Huang,Ziliang Zhang,Haoning Wu,Ruicheng Zhang,Yaokun Li,Zijun Wang,Yuechen Zhang,Chun-Mei Tseng,Hanhui Li,Shengju Qian,Jun Zhou,Kaipeng Zhang,Xiaodan Liang,Jiaya Jia,Xiu Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Recent game world models can synthesize visually plausible, action-conditioned rollouts. However, their interaction behaviors often remain limited to exploratory or wandering trajectories, and physical dynamics are typically learned as implicit correlations from data rather than as controllable variables. This limitation hinders their applicability to authored game environments, where physical rules are deliberately designed and require explicit manipulation. We introduce PhysEditWorld, a multimodal dataset with physical parameters, with a primary focus on gravity in this initial version. At its core, PhysEditWorld is built upon a replay paradigm implemented with a UE5 replay-and-rendering pipeline. Each scenario records a normalized action trace and replays the same initial state, character controller, action sequence, and camera policy under multiple gravity configurations, enabling controlled and attributable physical variation. PhysEditWorld contains 12 cinematic UE5 scenes, over 100 hours of gameplay interactions, and more than 60 million rendered rollout frames. Each sample provides synchronized multimodal signals, including RGB, depth, normals, audio, action traces, camera trajectory, engine states, semantic annotations, and explicit gravity labels. We further conduct initial utility studies on both generative video models and world understanding models, demonstrating that PhysEditWorld enables improved gravity-faithful dynamics modeling, enhances consistency under physical edits, and provides a scalable foundation for controllable world modeling research.
[CV-75] DeCoFlow: Structural Decomposition of Normalizing Flows for Continual Anomaly Detection
链接: https://arxiv.org/abs/2606.26687
作者: Hun Im,Jungi Lee,Subeen Cha,Pilsung Kang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In industrial environments, new product categories arrive sequentially, requiring continual anomaly detection without access to past data. Normalizing Flows (NFs) provide exact density estimation but suffer from catastrophic forgetting as parameter updates across tasks distort the density manifold. While parameter isolation can prevent interference, it must preserve the strict invertibility and Jacobian validity of NFs. To satisfy these requirements, we exploit the inherent property that affine coupling layers maintain transformation validity regardless of subnet parameterization. Based on this, we propose DeCoFlow, which decomposes subnets into a frozen universal base and task-specific low-rank adapters to isolate updates. We further introduce Task-Specific Alignment, Auxiliary Coupling Layers, and Tail-Aware Loss to compensate for frozen-base rigidity. DeCoFlow achieves state-of-the-art image-level AUROCs of 98.40% on MVTec-AD and 93.00% on VisA, while maintaining parameter-level zero forgetting (0.00% FM under correct routing) with only 2.27M parameters per task.
[CV-76] Disco-LoRA: Disentangled Composition of Content Style and Motion for Multi-concept Video Customization
链接: https://arxiv.org/abs/2606.26668
作者: Xuancheng Xu,Gengyun Jia,Bing-Kun Bao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Video customization based on Text-to-Video (T2V) models aims to learn specific features from reference data to generate controllable videos. While significant strides have been made in image stylization and video motion customization, simultaneously controlling multiple concepts, such as content, style, and motion, remains a major challenge. In this work, we systematically define the task of multi-concept video customization, which requires the joint control of content, style, and motion. To facilitate research in this area, we construct a comprehensive benchmark and propose Disco-LoRA, a unified framework designed to tackle this problem by disentangling and flexibly recombining different concepts in two stages: (1) We decompose the objective into two sub-tasks: Content-Style and Content-Motion. Each sub-task is addressed using our Iterative Dual-LoRA Disentanglement Framework, which effectively disentangles distinct concepts within the data. (2) We identify layer-wise weight trends as crucial for LoRA identity, while weight magnitudes dictate composability. To harmonize these scales, we propose a Z-score-based statistical regularization that aligns weight distributions, preserving layer-wise trends while minimizing interference between different LoRAs. Extensive experiments show that Disco-LoRA excels in multi-concept video customization, effectively preserving appearance, style, and motion for controllable text-to-video generation.
[CV-77] LayersReg: A Layer-by-Layer Progressive Regressor for Reliable Intraoperative 3D/2D Registration
链接: https://arxiv.org/abs/2606.26647
作者: Xiyuan Wang,Zhenchao Wang,Xinran Chen,Junkai Liu,Chuan Chen,Feng Yin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D/2D registration serves as a cornerstone technique in surgical navigation. Traditional iterative optimization algorithms suffer from low efficiency and high failure rates in intraoperative settings. Deep learning-based methods reformulate registration from iterative optimization to a regression problem that maps image appearance features to spatial pose, typically achieving improved real-time performance and accuracy. However, such learnable methods are confined to memory-driven retrieval of specific pose features rather than understanding the task of image alignment itself, which limits their generalization in complex scenarios. We propose LayersReg, a pioneering regression paradigm that endows the model with 3D anatomical awareness and searches for the correct pose in a progressive, layer-by-layer manner. Inspired by the iterative pose-searching optimization criterion of classical registration, LayersReg searches for correlations between the moving and fixed images in feature space, capturing the trend of pixel flow and thereby converging iteratively toward the correct spatial pose transformation. We further design a coupling of node-wise regression with the progressive registration framework to enhance the model’s perception of spatial pose changes. Experimental results demonstrate that under large offsets and multimodality conditions, LayersReg achieves high accuracy on both X-ray/CT registration (0.68°, 1.41 mm) and slice localization (0.73°, 1.55 mm) tasks, outperforming existing state-of-the-art methods while meeting the intraoperative demands for precision and real-time capability.
[CV-78] FracEvent: Event-Camera Simulation via Fractional-Relaxation Pixel Dynamics
链接: https://arxiv.org/abs/2606.26636
作者: Langyi Chen,Chuanzhi Xu,Haoxian Zhou,Pengfei Ye,Ziyu Luo,Haodong Chen,Qiang Qu,Xiaoming Chen,Weidong Cai
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
Abstract:Event cameras asynchronously report brightness changes with microsecond-level temporal resolution, but real event data remain difficult to collect at scale because specialized sensors, careful synchronization, and task-specific annotations are required. Event-camera simulation is therefore important to event-based vision tasks. Most practical simulators build on contrast-threshold event generation, some with additional filtering, stochastic noise, or hand-tuned sensor parameters. While effective, such formulations often simplify the temporal structure produced by the lifecycle of each pixel, which can distort event timing and weaken downstream transfer. We introduce FracEvent, an event simulator that models this pixel-level lifecycle with fractional-relaxation voltage dynamics. Given a log-intensity trajectory, FracEvent drives a compact stack of relaxation modes, combines their responses into a voltage state, emits ON/OFF events by localizing threshold crossings on the continuous voltage trajectory, and updates the reference while retaining the underlying memory modes. This retained state links residual voltage response to later event timing. We evaluate FracEvent through event-stream comparison and downstream transfer on image reconstruction and optical flow estimation. Across multiple datasets, FracEvent improves the temporal structure of generated events and achieves stronger downstream-transfer results than competing simulator baselines, showing its practical value for event-camera simulation.
[CV-79] mporally Consistent Label Interpolation for Robust Surgical Multi-Task Learning under Challenging Conditions
链接: https://arxiv.org/abs/2606.26634
作者: Garam Kim,Juyoun Park
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17pages, 16figures
Abstract:Effective multi-task learning for surgical scene understanding is fundamentally hindered by annotation granularity mismatch; temporal workflow tasks such as phase recognition, step recognition and anticipation benefit from dense frame-level supervision, whereas pixel-level spatial tasks including instrument segmentation and action recognition are only sparsely annotated on selected keyframes due to prohibitive labeling costs. This supervision imbalance undermines shared representation learning and limits joint optimization across heterogeneous surgical tasks. To address this, we propose Flow-guided Annotation for Robust Operating Scenes (FAROS), a flow-guided label interpolation framework, that combines zero-shot segmentation-based mask propagation with optical flow estimation to overcome the limitations of appearance-based propagation under challenging surgical conditions such as occlusion, smoke, and motion blur, generating temporally consistent dense pseudo labels from sparse keyframe annotations. The densified instrument masks and action labels are integrated into a unified Transformer-based multi-task framework that jointly learns surgical phase recognition, step recognition, anticipation, instrument segmentation, and action recognition, enabling balanced optimization between dense temporal supervision and sparse spatial supervision. The label interpolation quality of FAROS is first validated on the DAVIS 2017 benchmark under a sparse ground-truth protocol, confirming robust propagation beyond the surgical domain. Extensive experiments on GraSP, MISAW, and AutoLaparo benchmarks further demonstrate that FAROS significantly improves cross-task representation learning and enhances holistic surgical scene understanding performance across spatio-temporal tasks.
[CV-80] Position Rebinding Cache Reuse: Replay-Free Visual Revisiting for Interleaved Multimodal Reasoning
链接: https://arxiv.org/abs/2606.26631
作者: Mengzhao Wang,Yanli Ji,Wangmeng Zuo,Peng Ye,Chongjun Tu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Interleaved multimodal reasoning improves visual grounding by revisiting visual evidence during multi-step generation, yet existing methods typically rely on token replay, repeatedly forwarding selected visual tokens. A natural shortcut is to reuse the historical visual key-value (KV) cache directly. However, we identify a critical failure mode of this strategy: cached visual keys are already bound to their original positional context. Such stale positional binding distorts attention under later decoding contexts and can trigger severe autoregressive decoding collapse. This failure suggests that effective cache reuse requires reconstructing visual evidence under positions compatible with the current decoding state, rather than directly copying position-bound historical cache entries. To this end, we propose Position Rebinding Cache Reuse (PRCR), a cache-level framework for replay-free visual revisiting. PRCR stores raw visual KV cache together with their original spatial coordinates, then reassigns position-compatible coordinates to select entries and rebinds their keys before injecting the reconstructed cache into the active decoder cache. This design reuses historical visual evidence while preserving textual positional continuity and relative visual structure. Experiments across multiple multimodal reasoning benchmarks show that PRCR achieves replay-level or better performance, improving average accuracy by 5 percent and reducing visual-revisiting computation by up to tens of thousands of times.
[CV-81] askTok: Delving into Task Tokens for Task-driven Image Restoration ECCV2026
链接: https://arxiv.org/abs/2606.26615
作者: Hongjae Lee,Sojung Kang,Jaeseong Yu,Seung-Won Jung
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: ECCV 2026
Abstract:While traditional image restoration focuses on perceptual quality, Task-Driven Image Restoration (TDIR) aims to maximize the performance of downstream high-level vision tasks. Recent approaches leveraging generative priors have shown promise for TDIR; however, they typically suffer from computational inefficiency and potential semantic alteration by indiscriminately updating all latent tokens. In this paper, we posit that not all visual information is equally important for machine perception. Through an analysis of the latent token space, we observe that task-relevant cues are unevenly distributed across the token sequence, exhibiting index-wise specialization. This suggests that selectively refining a subset of tokens can be sufficient for task-driven objectives. Leveraging this insight, we propose TaskTok, a novel framework that selectively restores only task-relevant tokens via a learnable token switch and a lightweight token refinement module. Extensive experiments across image classification, semantic segmentation, and object detection demonstrate that TaskTok significantly enhances task performance with high computational efficiency. The source code is available at this https URL
[CV-82] LogicIR: Logic Gate Networks for Image Restoration ECCV2026
链接: https://arxiv.org/abs/2606.26609
作者: Hongjae Lee,Myungjun Son,Jaeseong Yu,Seung-Won Jung
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026
Abstract:Image restoration aims to reconstruct high-quality images from degraded low-quality inputs. As the computational demands of image restoration models continue to rise, there is growing interest in lightweight architectures optimized for fast and efficient inference. Logic gate networks (LGNs), which operate using fundamental logic operations such as NAND and XOR, have recently emerged as a promising direction for achieving highly efficient computation. However, their potential remains largely untapped in the domain of image restoration. In this work, we introduce LogicIR, the first LGN specifically designed for image restoration tasks. LogicIR incorporates a UNet-inspired architecture composed entirely of logic gates. In addition, we propose a differentiable bit decoding layer and an index shuffling mechanism that improves information propagation across logic gates. Experimental results across multiple image restoration benchmarks demonstrate that LogicIR achieves strong performance with significantly reduced computational cost, establishing LogicIR as a viable and efficient alternative for image restoration. The source code is available at this https URL
[CV-83] DiCoBench: Benchmarking Multi-Image Fine-Grained Perception via Differential and Commonality Visual Cues ECCV2026
链接: https://arxiv.org/abs/2606.26602
作者: Geng Li,Yuxin Peng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026. Project page with code: this https URL
Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive fine-grained perception capabilities. However, existing benchmarks predominantly rely on explicit textual cues or low-resolution inputs, failing to evaluate a model’s ability to autonomously perceive implicit visual cues in high-resolution. To bridge this gap, we introduce DiCoBench, a comprehensive, multi-image high-resolution benchmark designed for cross-image fine-grained perception. DiCoBench consists of 765 meticulously curated samples categorized into two progressive tracks: Differential Visual Cues and Commonality Visual Cues, covering 8 distinct perception tasks. By formulating the benchmark as a multiple-choice question task and utilizing high-resolution imagery (approaching 2K), we eliminate evaluation metric bias and pose a substantial challenge to current state-of-the-art MLLMs. Our extensive evaluation of 18 diverse MLLMs reveals a striking performance gap compared to human accuracy (98.3%), with top-performing models struggling significantly with micro-scale detail capture. We believe DiCoBench will serve as a challenging testbed to drive future research in autonomous, high-resolution multi-image perception.
[CV-84] SpaceRipple: Lightweight Semantic Delivery for Mission-Oriented LEO Earth Observation Satellite Networks
链接: https://arxiv.org/abs/2606.26559
作者: Ziyi Yang,Hao Yuan,Yunxiang Yi,Wenbo Wang,Xing Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:
Abstract:Earth observation satellite networks generate massive volumes of high-resolution imagery, whereas inter-satellite and downlink resources remain limited. In many time-sensitive missions, ground users require mission-relevant semantic information rather than a full raw-image downlink. This paper proposes SpaceRipple, a lightweight framework for mission-oriented semantic delivery and on-board processing in Earth observation satellite networks. A sensing satellite performs adaptive compression and metadata generation to reduce inter-satellite traffic, while an edge computing satellite restores the received representation and extracts task-relevant semantic information. Unlike fidelity-driven image transmission, SpaceRipple coordinates compression, forwarding, restoration, and semantic inference within a collaborative pipeline, enabling semantic-oriented delivery instead of pixel-level image delivery. A compression-aware MoE enhancement module is further introduced to improve robustness under degraded visual inputs. Experimental results show that SpaceRipple achieves favorable reconstruction quality, improved semantic detection performance, and substantial bandwidth savings, demonstrating its potential for efficient and reliable Earth observation under constrained satellite-network resources.
[CV-85] Coarse-to-Fine: A Hybrid Self-Supervised Method for Non-rigid 3D Shape Matching
链接: https://arxiv.org/abs/2606.26557
作者: Feifan Luo,Ting Li,Zhao Li,Hongyang Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Non-rigid 3D shape matching is a fundamental task in computer vision and graphics. In this paper, we propose a hybrid self-supervised method based on a coarse-to-fine strategy, which ensures consistency between the coarse mapping and the refined correspondence produced by our refinement module. The architecture features a dual-branch design, consisting of two symmetric functional map learning streams: one based on the Laplacian basis and the other utilizing the elastic basis. Extensive experiments show that our approach not only maintains computational efficiency, but also achieves state-of-the-art performance across a variety of challenging scenarios, including non-isometric deformations and topological noise. Finally, we rigorously demonstrate that contrastive energies promote feature discrimination. Furthermore, integrating these energies with existing methods yields consistent improvements, validating the overall efficacy of our approach. Our code is available at this https URL.
[CV-86] Perception Verdict and Evolution: Hindsight-Driven Self-Refining Forensics Agent for AI-Generated Image Detection
链接: https://arxiv.org/abs/2606.26552
作者: Yangjun Wu,Keyu Yan,Yu Liu,Jingren Zhou,Fei Huang,Rong Zhang,Zhou Zhao,Fei Wu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:The rapid advancement of generative models presents a significant challenge to existing deepfake detection methods, particularly given the widespread dissemination of highly realistic AI-generated images. Although Multimodal Large Language Models (MLLMs) show strong potential for this task, existing approaches suffer from two key limitations: insufficient sensitivity to fine-grained forensic artifacts and reliance on static synthetic supervision from frontier models, leading to limited flexibility and high-cost. To address these issues, we propose ForeAgent, an agentic forensics framework for AI-generated image detection with iterative self-evolution. First, ForeAgent adopts a Perception-Verdict architecture that aggregates multi-view cues spanning semantic, spatial, and frequency-domain features, and leverages an MLLM as a verdict module to fuse these signals for a logical-grounded verdict. Second, to enable continual self-improvement, we introduce a Hindsight-Driven Self-Refining strategy following a Sampling-Reflection-Evolution paradigm. The agent performs inference rollouts on training instances. Guided by ground-truth labels as hindsight, it reflects on failure cases and low-quality reasoning trajectories to regenerate higher-quality reasoning traces. These synthesized samples are then strictly filtered through a dual-expert quality gating module. ForeAgent continuously evolves via fine-tuning on self-curated high-quality samples. Extensive experiments demonstrate that ForeAgent achieves state-of-the-art performance on the Chameleon benchmark, reaching 82.18% accuracy (+16.41% over AIDE), and achieves 93.3% mean accuracy on AIGCDetect-Benchmark across 16 generators. In addition, external evaluation shows that ForeAgent produces more consistent and causally grounded reasoning compared to GPT-5 and GPT-5-mini.
[CV-87] PhyEditBench: A Real-World Multi-Stage Benchmark for Physics-Aware Image Editing ECCV2026
链接: https://arxiv.org/abs/2606.26551
作者: Shengbin Guo,Shaokang He,Chaoyue Meng,Shengpeng Xiao,Xunzhi Xiang,Shaofeng Zhang,Qi Fan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 6 figures, 2 tables. Accepted to ECCV 2026
Abstract:While instruction-based image editing, enabled by multi-modal generative models, has advanced significantly, existing benchmarks lack a comprehensive evaluation of physics-based reasoning, a critical capability for handling real-world scenarios. To address this, we introduce PhyEditBench, a benchmark designed to assess the physical understanding of editing models. Guided by a hierarchical taxonomy, we establish 4 primary classes and 12 subclasses. It comprises 238 high-quality, high-resolution, real-world instances meticulously extracted from videos to capture authentic physical dynamics, alongside 35 synthetic Anti-Physics instances. Our empirical analysis of current SOTA editing methods exposes substantial limitations in their physics-based reasoning. We further propose a training-free baseline named PhyWorld that uses test-time scaling and a latent reduction strategy. PhyWorld outperforms comparable models and suggests that the video generation process can effectively serve as a reasoning mechanism for image editing. The project page is available at this https URL.
[CV-88] From Hallucination to Grounding: Diagnosing Visual Spatial Intelligence via CRISP ECCV2026
链接: https://arxiv.org/abs/2606.26535
作者: Zhixing Li,Yinan Yu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ECCV 2026
Abstract:Current VLM evaluations often conflate language priors with genuine spatial reasoning. To address this, we introduce CRISP, a novel structural-diagnostic evaluation paradigm that assesses visual spatial intelligence through consistency, the alignment between implicit perception and explicit reasoning. Unlike traditional black-box QA, CRISP utilizes metric 3D Scene Graphs and an oracle intervention protocol to decouple latent reasoning capabilities from perceptual bottlenecks. This granular diagnosis uncovers a systematic perception-reasoning disconnect. Crucially, we reveal that while proprietary models possess robust latent reasoning engines, they suffer from inaccurate metric estimation and a critical failure to leverage their implicit structural representations. Conversely, open-source models remain fundamentally bottlenecked by their lack of multi-hop compositional reasoning. By shifting the focus from merely guessing correctly'' via language priors to genuinely perceiving, verifying, and reasoning,‘’ CRISP offers a rigorous roadmap for multimodal alignment beyond end-to-end post-training. The code and dataset are available at this https URL.
[CV-89] Forget Anticipate and Adapt: Test Time Training for Long Videos ECCV2026
链接: https://arxiv.org/abs/2606.26515
作者: Rajat Modi,Sebastian Noel,Xin Liang,Yogesh Singh Rawat
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026. GLOM/APM’s temporal binding now works for long videos
Abstract:Test Time Training (TTT) is a mechanism in which a model adapts to an incoming test-sample by performing some self-supervised (SSL) task and updating its weights even during inference. This procedure does not require labels at test-time. This paper focuses on TTT for long-videos. A major concern with existing approaches is: 1) they perform TTT updates using a sliding window containing frames in the past, whose compute increases linearly with the size of window. This becomes computationally intractable when the videos are hours long. 2) TTT is performed even when temporally close frames look similar, thereby consuming a lot of compute. We present the Frame Forgetting Network (FFN) that: 1) operates on only three frames within the sliding window, namely the frame that exits, the current frame and the frame after that. The model still manages to retain temporal context and work for hours long-videos; 2) mathematically define a surprise metric: how much new information the incoming frame contains with respect to the past seen frame. This facilitates determining how to modify the effective window size during TTT and constitutes the core mechanism of an adaptive windowing algorithm. Additionally, we curate a dataset EpicTours containing up to 3 hour long videos of walking city-tours, whereas earlier datasets on this problem were only 5 min long. We demonstrate FFNs empirical effectiveness on dense-segmentation, video classification tasks, generalization to depth-estimation, and multi-hour long videos. Comments: ECCV 2026. GLOM/APM’s temporal binding now works for long videos Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.26515 [cs.CV] (or arXiv:2606.26515v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.26515 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-90] Active Adversarial Perturbation-driven Associative Memory Retrieval for RGB-Event Visual Object Tracking
链接: https://arxiv.org/abs/2606.26455
作者: Xiao Wang,Xufeng Lou,Zikang Yan,Lan Chen,Sibao Chen,Yaowei Wang,Yonghong Tian,Jin Tang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:RGB-Event tracking improves localization robustness by fusing RGB appearance textures and dense temporal motion cues from event sensors. While this multi-modal scheme broadens tracking applicability, real-world scenes suffer diverse structured signal degradations that hinder traditional multi-modal fusion. In harsh environments, either modality can lose reliability drastically, and targets frequently appear incomplete due to occlusion, edge truncation and foreground this http URL tackle the above challenges, we present a hierarchical perturbation and retrieval framework tailored for RGB-Event tracking with robustness against partial target missing and modal degradation, termed APRTrack. To mimic real-world signal corruption, APRTrack constructs structured degradation via two adversarial perturbation branches at the modality and spatial levels, which separately simulate full-modal failure and localized target region absence. A hierarchical routing mechanism is designed to disentangle the training pipelines of the two perturbation types, effectively eliminating feature collapse induced by superimposed degradation constraints. Furthermore, we devise Footprint-guided Channel-calibrated Hopfield Retrieval (FCHR) for reliable historical information compensation. This module evaluates retrieval confidence based on association footprints between queries and memory banks, and calibrates the retrieval metric space prior to Hopfield matching, realizing controllable historical feature compensation bounded to target regions. Extensive experiments on FE108, COESOT, VisEvent, and FELT datasets demonstrate the effectiveness of our proposed strategies for the RGB-Event visual object tracking. The source code and pre-trained models will be released on this https URL
[CV-91] WatchAct: A Benchmark for Behavior-Grounded Robot Manipulation
链接: https://arxiv.org/abs/2606.26443
作者: Baiqi Li,Ce Zhang,Yu Fang,Yue Yang,Shangzhe Li,Mingyu Ding,Gedas Bertasius
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A robot working alongside people must reason about what they have done, in what order, and with what intent. Video carries the spatial layouts, object histories, and gestures that language leaves underspecified, yet today’s manipulation benchmarks pair an instruction with a single current image, offering no way to evaluate reasoning over observed human behavior. We introduce WatchAct, a benchmark for robot manipulation grounded in observed human behavior. Each instance pairs a real-world human-action video and a language instruction with an aligned simulator scene and an executable LIBERO task, enabling scalable and reproducible evaluation. WatchAct comprises 3,000 long-horizon instances across 14 tasks in four capability domains drawn from the cognitive demands of watching another agent: parsing events (Event Grounding), recovering procedural structure (Procedural Reasoning), inferring unstated intent (Implicit Intent Inference), and tracking how the scene was changed (Episodic Reasoning). We further propose a disentangled evaluation protocol that separately measures (i)~video-to-plan reasoning by vision-language models, (ii)~policy execution under oracle plans, and (iii)~full task completion by integrated planner–policy pipelines. In both simulation and on a Franka Research 3 robot, current systems remain far from solving WatchAct. The best pipeline, Gemini-3.1-Pro with \pi_0.5 , reaches only 16.3% Success Rate (SR) in simulation and 14.0% on the real robot. Gemini-3.1-Pro attains just 36.8% Plan SR (vs. 97.1% for humans), while \pi_0.5 reaches only 21.5% Task SR under oracle plans and drops to 10.6% on out-of-domain scenarios. Dataset and code are available at this https URL.
[CV-92] Rethinking Training Inference for Forecasting: Linking Winner-Take-All back to GMMs ECCV2026
链接: https://arxiv.org/abs/2606.26424
作者: Qiyuan Wu,Katie Z Luo,Bharath Hariharan,Wei-Lun Chao,Mark Campbell
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by ECCV 2026
Abstract:Trajectory forecasting for autonomous driving has advanced rapidly, yet representative models often produce uninformative posteriors over forecast modes, causing problems for mode pruning. We trace this to a modeling-training mismatch: forecasters are typically modeled as conditional Gaussian mixture models (GMMs) but trained with a winner-take-all (WTA) loss that assigns each sample to its nearest mode. We argue that this K-means-like hard assignment (one-hot), while preventing mode collapse, is the source of uninformative mode probabilities: it over-segments the trajectory space, ignores relatedness among nearby modes, and yields assignment instability under small perturbations. Guided by this lens, we introduce two post-hoc treatments: (1) test-time posterior-weighted merging that aggregates nearby candidate trajectories; and (2) a one-step expectation-maximization (EM) update that replaces hard labels with soft responsibilities, sharing probability mass across neighboring modes. Across several WTA-trained architectures, these lightweight steps produce more informative, faithfully ranked mode posteriors and strengthen final forecasts on popular displacement metrics – without retraining. Our analysis unifies recent design choices through a GMM-vs-K-means perspective and offers principled, practical corrections that better align training objectives with inference.
[CV-93] Methane-Plume Segmentation From Hyperspectral Satellite Imagery Via Multimodal Deep Learning
链接: https://arxiv.org/abs/2606.26416
作者: Brayan Quintero,Jeferson Acevedo,Samuel Traslaviña,Hoover Rueda-Chacón
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE International Geoscience and Remote Sensing Symposium (IGARSS) 2026
Abstract:Efficient detection of methane plumes is crucial for understanding and mitigating global warming, as accurately identifying and segmenting them in earth observation imagery remain essential for large-scale monitoring. In this work, we propose a multimodal deep learning model that integrates a feature-guided methane enhancement (FGME) mechanism which injects physically meaningful methane cues into transformer-based RGB representations at multiple semantic scales. Our method is evaluated on the MPDataset, where it outperforms the state-of-the-art with improvements of +0.92 in MIoU, +0.87 in MPrecision and +1.01 in Recall. Notably, these gains are obtained with a substantially lower computational cost than other high-performing architectures, resulting in a favorable accuracy-efficiency trade-off for large-scale methane monitoring. These results highlight the potential of efficient multimodal fusion strategies for accurate and scalable methane plume segmentation in real-world remote sensing applications.
[CV-94] Neural Voxel Dynamics: Learning Implicit 3D Physics via Volumetric Feature Advection
链接: https://arxiv.org/abs/2606.26410
作者: Zican Wang,Niloy Mitra
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a self-supervised framework for learning implicit 3D physical dynamics directly from video-derived supervisory signals. While current generative video models achieve high visual fidelity, they lack a 3D geometric foundation, often resulting in physical inconsistencies and a failure to maintain object permanence. We address this by shifting the predictive bottleneck from 2D image space to a `lifted’ 3D Volumetric Latent Space. Our method unprojects semantic features from a Video Joint-Embedding Predictive Architecture (V-JEPA) into a voxelized grid, grounded by monocular depth priors. This lifting enables a Volumetric Feature Advection to learn an action-conditioned transition operator that treats physics as a spatio-temporal state advection problem, i.e., learn implicit 3D physics. Unlike state-of-the-art hybrid models that rely on explicit classical simulators for training and/or inference, our architecture tracks material states implicitly within high-dimensional V-JEPA features. This allows for the emergent simulation of heterogeneous phenomena (e.g., rigid body motion in fluid flow) within a single, unified pipeline. Supervised solely via end-to-end video-derived signal plus action conditions, without access to physics engine internal states, labels, or surrogate models, our model demonstrates good long-term structural stability and physical plausibility on multiple benchmarks (CLEVERER, PhysInOne, PhysGaia). We believe that this work opens a scalable pathway toward general-purpose dynamic world models that internalize the 3D invariants of the physical world solely through passive observation of monocular videos.
[CV-95] DinoLink: A Token-Centric Representation Compression Framework for Bandwidth-Constrained Collaborative V2X Perception
链接: https://arxiv.org/abs/2606.26398
作者: Tianle Zhu,Haohua Que,Handong Yao,Hongyi Xu,Zhipeng Bao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-precision remote perception is often hindered by the severe bandwidth constraints of Vehicle-to-Everything (V2X) networks. We propose \textitDinoLink, a token-centric compression framework that replaces raw pixel streaming with discrete semantic communication for vehicle-cloud collaborative inference. DinoLink employs a dual-sparsity architecture: a saliency-aware selector prunes redundant background tokens, while a Residual Vector Quantization (RVQ) module collapses features into compact codebook indices. By transmitting only lightweight indices and positional priors, DinoLink achieves a 139\times bitrate reduction compared to uncompressed transmission while maintaining a competitive 32.8% mAP on the nuScenes dataset. Deployment simulations further demonstrate a 34.5\times acceleration in narrow-band environments, such as LoRa. Our results substantiate DinoLink as a robust, bandwidth-efficient frontend for high-fidelity remote perception in constrained V2X scenarios. The code is publicly available at this https URL.
[CV-96] What Do Deepfake Benchmarks Measure? An Audit Using Frozen Self-Supervised Representations
链接: https://arxiv.org/abs/2606.26384
作者: Samuel Pagon,Yixuan Shen,Vishal Asnani,Feng Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 9 figures
Abstract:As deepfake generators approach perceptual indistinguishability, reliable detection becomes critical. Yet, detectors that score well on benchmarks routinely fail in the wild. A concerning feedback loop has emerged: benchmarks drive increasingly complex, engineered detectors, yet if those benchmarks do not reflect real-world deepfakes, this complexity may be solving the wrong problem entirely. This raises a prior question: what are these benchmarks actually measuring? We conduct an audit of video, image, and audio deepfake benchmarks using a deliberately simple diagnostic. If a linear probe on frozen, general-purpose self-supervised representations can approximate the performance of a bespoke detector, the benchmark is largely rewarding general modality understanding rather than forensic understanding. This has two implications: the benchmark may not reflect realistic threat models, and it raises the question of whether the bespoke detectors the probe approaches are truly learning forensic understanding. We observe, across three modalities, linear probes on general-purpose self-supervised representations closely approach the performance of bespoke detectors. We further show that generator-level difficulty is partly explained by Frechet geometry in the same representation space. Together, these results support a benchmark-audit view of deepfake detection: before high scores are read as evidence of forensic understanding, it is worth asking how much of the benchmark is already solved by general-purpose representations.
[CV-97] Layer-Specific Prompt Fusion Discovery via Differentiable Search in Vision Foundation Models ECCV2026
链接: https://arxiv.org/abs/2606.26379
作者: Xi Xiao,Xingjian Li,Yunbei Zhang,Cheng Han,Tianming Liu,Tianyang Wang,Runmin Jiang,Jihun Hamm,Xiao Wang,Min Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026
Abstract:Visual prompt tuning has emerged as a parameter-efficient fine-tuning approach for adapting large-scale Vision Transformers (ViTs) to downstream tasks. As its learnable prompts are applied in input and feature spaces, prior to jointly going through attention in transformer layers, the most commonly used scheme for fusing image and prompt tokens is concatenation or addition. In this paper, we aim to study a fundamental yet essential problem in visual prompt tuning: whether a single fusion scheme tends to yield better results, and whether that would be beneficial to develop a hybrid fusion scheme. To this end, we formulate the task as a bi-level optimization problem, and solve it leveraging differentiable architecture search. In this context, the learnable prompts and their fusion schemes are jointly optimized. To enrich the search space in the architecture search, we propose two additional fusion schemes, namely, affine transformation and cross-attention, in addition to concatenation and addition. Extensive experiments on 34 datasets spanning VTAB-1k, FGVC, and HTA show consistent gains over prompt-tuning baselines. With a frozen ViT backbone, our method delivers a favorable accuracy–latency–parameter trade-off compared with VPT-Deep and recent variants. Our findings reveal that how prompts fuse with image tokens plays a significant role in visual prompt tuning, and a hybrid fusion fashion can more effectively leverage layer semantics of ViTs, contributing a novel perspective for visual prompt-tuning research.
[CV-98] Beyond Aesthetics: Quantifying Information Loss in Turbid Scenes
链接: https://arxiv.org/abs/2606.26295
作者: Vasiliki Ismiroglou,Stefan H. Bengtson,Tasos Benos,Thomas B. Moeslund,Malte Pedersen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visibility in underwater environments degrades rapidly under turbid conditions, yet the effects on computer-vision models remain unclear. This issue is compounded by reliance on synthetic turbidity datasets, which may misrepresent real-world information loss. To address this gap, we introduce the Turbid Underwater Baseline (TUB) dataset, comprising 1,320 images captured under extreme turbidity and over 16,000 high-confidence ground-truth segmentation masks. We additionally propose PCD, a metric derived from phase congruency maps that is invariant to contrast and aims to capture the loss of structural information in real turbidity. We show that PCD correlates strongly with the performance of instance segmentation models on both real and synthetic turbid images, whereas common metrics in the field show weak to no correlation at all. The dataset and relevant code can be found on the project page: this https URL
[CV-99] GeMoE: Gating Entropy is All You Need for Uncertainty-aware Adaptive Routing in MoE-based Large Vision-Language Models
链接: https://arxiv.org/abs/2606.26287
作者: Chaoxiang Cai,Minghe Weng,Jie Li,Yibo Jiang,Longrong Yang,Zequn Qin,Xi Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the increase in model parameters and training data, the instruction following and generalization capabilities of Large VisionLanguage Models (LVLMs) have been significantly improved. Based on the Mixture of Experts (MoE) architecture, LVLMs expand their parameter capacity while maintaining the inference cost. However, traditional MoE methods employ a Top-k static routing strategy, which fails to account for variations in the input and adaptively select the number of experts, resulting in suboptimal resource utilization. In this paper, we propose viewing token routing as an information encoding task, framing dynamic routing as a Minimum Description Length (MDL) problem in encoding By validating the connection between MDL and gating entropy in the MoE scenario, we introduce Gating Entropy-based Uncertainty-aware Adaptive Routing (GeMoE) for MoE. Unlike traditional static or heuristic-based dynamic routing methods, GeMoE explicitly models the trade-off between model complexity and performance. By using gating entropy to assess the complexity of tokens, GeMoE adaptively determines the number of experts each token should engage. On a wide range of backbones and benchmarks, our method achieves 99.5% average performance retention compared to the original static routing, while improving average expert activation sparsity by 36.5%.
[CV-100] Beyond Single-Source Cognitive Taskonomy:Multi-Source Task Relations through fMRI Transfer Learning
链接: https://arxiv.org/abs/2606.26279
作者: Junfeng Xia,Wendu Li,Mengjiao Zhang,Jie Guo
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Cognitive tasks are organized by shared and specialized neural processes. Masked fMRI reconstruction provides a common self-supervised objective for quantifying transfer relations among task states, but existing reconstruction-based taskonomies mainly study one-to-one transfer from a single source task to a target. Here, we extend an fMRI cognitive taskonomy from single-source to multi-source transfer across 23 Human Connectome Project task states and use Boolean Integer Programming (BIP) to analyze budget-constrained task allocation. We train 1,127 task-specific and transfer models. Single-source transfer is directional and paradigm structured: motor states transfer well within the motor paradigm but provide limited support to most non-motor targets, consistent with a shared sensorimotor execution system and effector-specific representations. Multi-source transfer depends on the composition of the source set, suggesting that many-to-one task relations are not fully captured by pairwise taskonomy alone. Across supervision budgets, BIP repeatedly allocates direct supervision to several 0-back and 2-back working-memory states, although these states are not consistently the strongest individual sources. This pattern may reflect the integration of perceptual, attentional, and executive processes in working-memory tasks. Together, these findings reveal a cross-paradigm-limited motor cluster and working-memory states with high priority under the specified global allocation objective. Our study extends reconstruction-based fMRI taskonomy from one-to-one transfer to many-to-one task relations and budget-constrained task dependencies.
[CV-101] A multi-task spatiotemporal deep neural network for predicting penetration depth and morphology in laser welding
链接: https://arxiv.org/abs/2606.26260
作者: Sen Li,Haichao Cui,Chendong Shao,Yaqi Wang,Xinhua Tang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:In laser penetration welding, the assessment of penetration state and weld seam morphology plays a crucial role in determining the weld quality. This paper presents a comprehensive introduction of the innovative muti-task deep learning model that has the capability to predict penetration state, depth, and weld seam morphology with high accuracy. The monitoring platform relies on weld pool images captured during the laser welding process using a complementary metal-oxide-semiconductor camera. The proposed model integrates spatiotemporal features extracted from top weld pool images along with welding parameters, establishing a deep learning framework based on convolutional neural networks and state space models for more efficient extraction and processing of spatial-temporal information. Furthermore, a reliable method for constructing the dataset is proposed to enhance both robustness and generalization capability of the developed model. Validation results on the test set demonstrate that prediction accuracy for penetration state can reach 99.35%, while prediction error for penetration depth is 1.79 millimeter, and accuracy of reconstructing the weld cross-section is 95.65%. This study provides new insights and methodologies for in-situ quality control strategies in laser penetration welding systems.
[CV-102] Fast LeWorldModel
链接: https://arxiv.org/abs/2606.26217
作者: Yuntian Gao,Xiangyu Xu
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Joint-Embedding Predictive Architectures (JEPAs), including recent LeWorldModel (LeWM), have become a promising foundation for reconstruction-free visual world models. For visual planning, however, LeWM evaluates candidate action sequences by repeatedly applying a local one-step latent transition model. This autoregressive rollout makes planning computationally expensive and exposes the predicted trajectory to accumulated latent errors as the horizon grows. We propose Fast LeWorldModel (Fast-LeWM), a fast latent world model that replaces repeated local rollout with action-prefix prediction. Given the current latent and a candidate action sequence, Fast-LeWM encodes its prefixes and predicts the future latents reached after executing those prefixes in parallel. By making action prefixes the basic prediction unit, Fast-LeWM directly models action effects accumulated to different extents over multiple horizons. This prefix-level supervision forces the model to learn how states continuously evolve under different action prefixes, rather than only fitting one-step state transitions. During planning, the predictor can use the last prefix token from the encoded action sequence to evaluate the corresponding future latent without explicitly rolling through each intermediate imagined state. Across multiple tasks, Fast-LeWM improves average success over LeWM while substantially reducing planning time, achieving lower open-loop latent loss whose growth becomes significantly slower as the rollout horizon increases.
[CV-103] askNPoint: How to Teach Your Humanoid to Hit a Backhand in Minutes
链接: https://arxiv.org/abs/2606.26215
作者: Blake Werner,Ilona Demler,Pietro Perona,Aaron D. Ames
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:How do we learn to hit a tennis backhand? Not from a thousand hours of tennis tournaments on TV - we work with a coach and practice. We argue this is also the right recipe for teaching dynamic skills to humanoid robots. This follows from a structural property of dynamic skills: the outcome is decided by a short, crucial portion of the trajectory - for a backhand, the ~20cm of racket travel around ball contact. Getting this interaction window right requires coordinating the whole motion, so that control, physics, and morphology act in concert. Learning thus reduces to mastering a handful of distinct actions and, for each, practicing until the window comes out right. To this end, we introduce TaskNPoint, a training protocol which makes the coach-learner division of labor explicit. The human coach contributes four inputs: a discrete set of skills (e.g. different shots), one demonstration per skill, identification of the interaction window, and the goal. Learning in a physically realistic simulation environment fills in each action trajectory and provides robustness to unmodeled events. Crucially, randomized target sampling during training lets a single demonstration generalize zero-shot to unseen goal locations. We test this approach on a Unitree G1 humanoid that hits forehands and backhands against balls thrown by a human, kicks incoming soccer balls, and picks and places boxes from novel locations. We find that learning is successful from short human video demonstrations and under an hour of training on a single GPU, with no per-task reward tuning.
[CV-104] Self-Supervised Tree-level Biomass Estimation in Urban Environments From Airborne LiDAR and Optical Observations
链接: https://arxiv.org/abs/2606.26194
作者: Jose Bermudez(1),Zilong Zhong(1),Dominic Cyr(2),Camile Sothe(3),Alemu Gonsamo(1) ((1) McMaster University, Hamilton, Ontario, Canada (2), Environment and Climate Change Canada, Montreal, Quebec, Canada, (3) Planet Labs PBC, San Francisco, California, USA)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
Abstract:Urban tree biomass remains less spatially explicitly quantified than biomass in managed forests because many estimates rely on inventories or coarse products that cannot resolve individual crowns or fine-scale heterogeneity. We present a crown-level above-ground biomass (AGB) framework for an 810~km ^2 landscape in Ontario, Canada, using leaf-off airborne LiDAR (8–10~pulses~m ^-2 ) and near-infrared RGB orthophotography (0.16–0.20~m) from 2018 and 2023. A dual-stream cross-attention network trained on rule-based pseudo-labels produced semantic marks for buildings, needleleaf trees, and deciduous trees, supporting crown delineation and functional-type assignment. On independently annotated withheld tiles, global/mean precision, recall, and Dice scores were 0.86, 0.83, and 0.84. Crowns were delineated with multiscale watershed segmentation in mapped tree areas, and AGB was estimated from a crown area–height power-law proxy calibrated to species-specific allometry (Lambert et al., 2005) for 21,921 inventory trees. For 18,713 inventory–segment matched pairs from a 90,726-tree held-out test set, AGB prediction achieved R^2=0.609 using inventory crown geometry and R^2=0.570 under operational segmentation, identifying crown delineation as the remaining uncertainty source. Aggregated to 30~m, estimates yielded total AGB stocks of 1.73~Tg in 2018 and 1.81~Tg in 2023 (811–850~Gg~C), local densities up to \sim140 ~Mg~ha ^-1 along the Niagara Escarpment, and a net carbon gain of 39~Gg~C over five years. Deep-ensemble uncertainty maps highlighted high-epistemic-uncertainty areas linked to underrepresented land covers and guided assignment of uncertain crowns to a pooled allometric equation. The framework uses standard provincial data, requires no manual annotation, and produces a public bitemporal crown-level AGB database for trees outside forests at management-relevant resolution.
[CV-105] LCG: Long-Context Consistent Image Generation with Sparse Relational Attention
链接: https://arxiv.org/abs/2606.26171
作者: Zihao Wang,Yijia Xu,Haoze Zheng,Xuran Ma,Haokun Gui,Harry Yang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent image generation models achieve impressive quality in single-image synthesis, but often fail to maintain consistency across sequential outputs, as required in comics, storyboards, and visual narratives. We propose Long-Context Generation (LCG), a framework for long-context multi-image text-to-image generation, to improve consistency and scalability in long-context multi-image generation. LCG employs the Sparse Relational Attention (SRA) mechanism to selectively attend to core features across extended visual contexts, ensuring that the propagation of semantic and layout information remains computationally tractable. To enforce semantic alignment, we introduce the Routing Consistency Constraint (RCC), which leverages identity-aware masks to align structural patterns across generation branches, effectively mitigating drift in appearance even in complex multi-character scenes. To support training and evaluation in this setting, we construct the Long-Context Consistency Dataset (LCCD), a large-scale synthetic dataset comprising character-centric multi-image sequences spanning varied situational contexts. LCCD contains 600K training sequences and a separate 1K test set, with each sequence containing 6 to 20 images. The experiments demonstrate that LCG outperforms the compared baselines in prompt alignment and character consistency for long-context image generation, including multi-character scenes.
[CV-106] Predicting Fruit Quality with a Hybrid Machine Learning and Image Processing Approach
链接: https://arxiv.org/abs/2606.26165
作者: Amir Reza Hashemi,Shahram Amiri
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 13 figures, 2 tables
Abstract:Fruit spoilage is a significant issue in agriculture, leading to substantial economic losses. Addressing this, our study introduces a hybrid approach combining image processing and deep learning to assess fruit freshness. We developed an image processing algorithm that quantifies spoilage on a scale from 0 (fully fresh) to 100 (fully rotten). Alongside, we trained a convolutional neural network (CNN) to perform binary classification (fresh or rotten) using a large dataset of fruit images. The outcomes of both methods were synthesized using logistic regression to enhance the accuracy of freshness predictions. Subsequently, this logistic regression model was utilized to enable the image processing algorithm to provide binary classification based on its percentage output, thus eliminating the need for the CNN in real-time applications. Our approach, which does not require high computational resources, achieved real-time performance and was validated with over 90% accuracy on a dataset comprising apples and oranges. The primary limitation lies in the requirement for fruits to be isolated on a background that must be either white or transparent, suggesting future improvements could include advanced segmentation models to automate background removal. This study’s results highlight the potential of integrating simple image processing techniques with machine learning to provide practical solutions in the agricultural sector.
[CV-107] DocArena: Turning Raw Documents into Controllable Training Environments for Document Search Agents
链接: https://arxiv.org/abs/2606.26122
作者: Jiamian Wang,Ruiyi Zhang,Tong Yu,Jing Shi,Samyadeep Basu,Rajiv Jain,Zhiqiang Tao,Tong Sun
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: search agent for documents
Abstract:Recent methods train search agents via reinforcement learning from (question, answer, evidence) tuples without requiring expert trajectories. The tuples serve as the training environment, and whose properties directly shape what search strategies and generalization abilities the agent can develop. While prior works have made encouraging progress in improving training data quality, existing environments remain predominantly text-based and existing approaches can struggle to construct training environments that are controllable, scalable, and account for multimodal data. Given this, we propose DocArena, a fully automated data curation pipeline building on the practical need for multimodal document search and question-answering. It transforms raw document collections into training environments for search agents without any human annotation. The pipeline first structures and indexes documents through MLLM-based visual perception, then profiles and leverage the cross-page information distribution to construct reasoning-intensive QA pairs, as well as performs cascaded quality assurance operations via MLLM. We introduce DocArena-79K with QA pairs from 8,336 documents spanning 16 domains and 49 languages. We further design a Doc-Search agent infrastructure that decouples visual perception from the policy model, allowing text-based LLMs to serve as the reasoning backbone for multimodal document retrieval and QA. Under a unified evaluation framework where only the policy model differs, experiments on six multimodal document scenarios and seven text-based QA benchmarks show that agents trained on DocArena data achieve the best performance on both retrieval accuracy and QA quality. Further analysis on agent search behaviors confirms the effectiveness and controllability of the constructed training environment.
[CV-108] Dot-Flik: A Scalable Edge AI Architecture for Distributed Insect Monitoring
链接: https://arxiv.org/abs/2606.26121
作者: Mattia Consani,Denisa-Andreea Constantinescu,Åse Håtveit,Titus Venverloo,Fabio Duarte,Carlo Ratti,David Atienza
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Global insect population declines necessitate scalable, continuous monitoring systems, yet existing vision-based solutions remain constrained by high hardware costs, energy demands, and reliance on centralized processing or cloud connectivity. This article presents three contributions to address these limitations. First, we propose a motion-informed frame filtering algorithm based on temporal differencing, gamma-corrected motion amplification, and block-based motion density analysis that discards irrelevant frames at the edge while preserving insect activity, without requiring deep learning inference on the sensing device. Second, we introduce a distributed, hierarchical IoT architecture that decouples data acquisition from AI classification through this edge-level preprocessing, projecting fractional scaling of central processing requirements and significantly increasing monitoring coverage compared to monolithic single-stream approaches. Third, we validate the complete system through real-world outdoor deployments on low-cost commodity hardware along four axes: real-time performance, network scalability, hardware cost, and energy efficiency under varying wind conditions. Results demonstrate 60-80% frame reduction under light-wind conditions, sustained real-time 30 FPS operation with 12.8 ms of computational headroom, up to 22.6% energy savings, and support for 5-6 concurrent edge streams per central node. These findings establish a practical foundation for dense, low-cost biodiversity monitoring networks in urban environments.
[CV-109] Dual-Prior Guided Null-Space Learning with Mixture-of-Splines for Arbitrary Medical Slice Super-Resolution ECCV2026
链接: https://arxiv.org/abs/2606.26716
作者: Haofei Song,Siyuan Xu,Xintian Mao,Shaojie Guo,Qingli Li,Yan Wang
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026! Project page: this https URL
Abstract:Arbitrary slice super-resolution reconstructs isotropic volumes from anisotropic clinical acquisitions by synthesizing intermediate slices at arbitrary scales. However, treating this ill-posed inverse problem as unconstrained residual-based regression risks hallucinating anatomically implausible structures or altering the originally observed data. To address both concerns, this paper presents the Dual-Prior Null-space Learning (DP-NSL) framework, which reformulates the task as a constrained recovery process guided by two complementary priors. A Measurement-Consistent Projection (MCP) enforces a Deterministic Observation Prior: the reconstruction undergoes an exact orthogonal projection that reproduces every acquired slice with zero error, confining all learned details to the unobservable null space. Within this null space, a Mixture-of-Splines (MoS) module imposes a Geometric Continuity Prior by dynamically mixing B-spline experts of different analytic orders, allowing each anatomical region to be modeled with a content-aware level of continuity. To promote spatial coherence, a Local Spatial Consistency Decoder (LSCD) further injects local inductive bias. Experiments on three CT and one MRI benchmark show that DP-NSL outperforms existing approaches while strictly preserving measurement consistency. Code is available at this https URL.
[CV-110] MLFFM-SegDiff: A Multi-Level Feature Fusion Diffusion Model for Skin Lesion Segmentation
链接: https://arxiv.org/abs/2606.26712
作者: Jingjun Gu,Chaojie Shen,Yifeng Cao,Wei Zhang,Yiliu Li,Aobo Fan
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Skin lesion segmentation is a key task in computer-aided dermatological diagnosis, where accuracy directly impacts downstream analysis and disease classification. However, dermoscopic images are challenging due to blurred boundaries, low contrast, large shape variations, and artifacts such as hair and shadows. Recently, diffusion models have shown strong performance in medical image segmentation thanks to their progressive denoising and distribution modeling capabilities. Nevertheless, existing diffusion-based methods still suffer from limited cross-level feature interaction and insufficient boundary detail recovery. To address these issues, we propose MLFFM-SegDiff, a multi-level feature fusion diffusion model for skin lesion segmentation. Built on a diffusion framework, the method introduces a dual-path U-Net encoder, a Multi-Level Feature Fusion Module (MLFFM), and a boundary-sensitive loss function. The dual-path encoder enhances interaction between noisy mask features and dermoscopic image features. MLFFM improves skip connections via attention, scale alignment, and adaptive cross-level fusion. These designs enable the decoder to jointly leverage shallow boundary cues and deep semantic representations, improving mask reconstruction quality. Experiments on ISIC2018, PH2, and HAM10000 demonstrate that MLFFM-SegDiff outperforms representative methods including DermoSegDiff, U-Net, and SwinUNETR across Accuracy, F1-score, Jaccard index, Recall, and Dice. In particular, it achieves an average Jaccard index of 0.8546 and Dice coefficient of 0.9207. These results validate the effectiveness of the proposed multi-level feature fusion strategy for improving lesion segmentation performance. The code will be released at this https URL after publication. Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.26712 [eess.IV] (or arXiv:2606.26712v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2606.26712 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-111] Revealing Mammographic Phenotypes in Deep Learning Breast Cancer Risk Models
链接: https://arxiv.org/abs/2606.26431
作者: Ruiyu Jia,Yanqi Xu,Yuxuan Chen,Yiqiu Shen,Laura Heacock
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Mammogram-based deep learning models have improved breast cancer risk prediction, but the learned imaging patterns remain underexplored. Existing interpretability methods rely on single-image saliency maps, failing to identify recurring mammographic phenotypes across large patient cohorts. By clustering patch embeddings from a pre-trained model, Mirai, we isolate recurring phenotypes linked to 5-year cancer risk. Analyses show risk-increasing phenotypes capture complex structures (e.g., dense tissue, microcalcifications) and shortcut artifacts (e.g., clips). These phenotypes correlate strongly with older age and higher BI-RADS density. Our framework connects tissue patterns to AI risk scores, revealing clinical signatures and potential latent model confounders.
[CV-112] ailor Made Embeddings for Quantum Machine Learning
链接: https://arxiv.org/abs/2606.26312
作者: Aldo Lamarre,Dominik Šafránek
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages, 17 figures
Abstract:Autoencoders transformed classical machine learning by solving the curse of dimensionality, enabling principled weight initialization and learning compact, structured representations. In this work, we extend this paradigm to quantum machine learning by introducing a variational autoencoder framework that learns task-specific quantum embeddings of classical data. We demonstrate that high-dimensional datasets, including ImageNet, can be compressed into a 13-qubit quantum representation while remaining reconstructable through a learned decoder. On MNIST (3 vs 5), our approach achieves 98.5% validation accuracy using a circuit-centric quantum classifier, within 1.2 percentage points of a classical neural network baseline (99.7%) and more than 30 percentage points above a naive amplitude-embedding approach. Unlike amplitude embeddings, which require full quantum state tomography for recovery, or angle embeddings, which generally rely on circuit inversion under restrictive assumptions, the proposed framework reconstructs the original data from only a polynomial number of measurements. The framework was further validated on IBM quantum hardware, confirming that the learned embeddings remain stable and reconstructable under real device noise.
[CV-113] Rendering Novel Views of MRI Using 3D Gaussian Splatting
链接: https://arxiv.org/abs/2606.26236
作者: Robin Y. Park,Mark C. Eid,Rhydian Windsor,Amir Jamaludin,Ana I.L. Namburete,João F. Henriques,Andrew Zisserman
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:
Abstract:The objective of this paper is to improve radiological gradings measured on MRIs of spines, by resampling scans so that the new view planes are better aligned with the target anatomy than the original sparse images. To this end, we adapt 3D Gaussian Splatting to form a volumetric reconstruction starting from sparse anisotropic MRIs, and imaging planes aligned with the anatomy relevant for clinical evaluation are then sampled and rendered. The novel view plane is optimal for diagnostic radiological grading of the target anatomy, whereas the original MRI is not. The resampled scans are then used to predict ordinal severity grades of localised stenosis conditions in spinal MRIs. We compare our method against Voxel Interpolation resampling, which takes the average of inverse-distance weighted nearest neighbour intensities for each target coordinate. Experiments show that across all stenosis conditions, resampled scans using Gaussian Splatting produce more accurate stenosis gradings compared to the raw scans which do not include the complete anatomy in-plane, as well as images resampled using Voxel Interpolation.
人工智能
[AI-0] Autoregressive Boltzmann Generators ICML2026
链接: https://arxiv.org/abs/2606.27361
作者: Danyal Rehman,Charlie B. Tan,Yoshua Bengio,Avishek Joey Bose,Alexander Tong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2026 (Spotlight)
Abstract:Efficient sampling of molecular systems at thermodynamic equilibrium is a hallmark challenge in statistical physics. This challenge has driven the development of Boltzmann Generators (BGs), which allow rapid generation of uncorrelated equilibrium samples by combining a generative model with exact likelihoods and an importance sampling correction. However, modern BGs predominantly rely on normalizing flows (NFs), which either suffer from limited expressivity due to strict invertibility constraints (discrete time) or computationally expensive likelihoods (continuous time). In this paper, we propose Autoregressive Boltzmann Generators (ArBG) – a novel autoregressive modelling framework – that overcomes these limitations by departing from the flow-based BG paradigm. ArBG circumvents the topological constraints of flows and enables sequential inference-time interventions, while offering enhanced scalability by leveraging architectures effective in Large Language Models. We empirically demonstrate that ArBG leads to significant improvements over flow-based models across all benchmarks, but particularly in larger peptide systems such as the 10-residue Chignolin. Furthermore, we introduce Robin, a 132 million parameter transferable model trained with the ArBG framework which improves over the previous state-of-the-art, reducing the zero-shot energy error, E-W _2 , on 8-residue systems by over 60 % . The code can be found at the following link: this https URL.
[AI-1] Understanding Domain-Aware Distribution Alignment in Budgeted Entity Matching
链接: https://arxiv.org/abs/2606.27342
作者: Nicholas Pulsone,Gregory Goren,Roee Shraga
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Entity Matching (EM) is a core operation in the data integration pipeline, where records from different sources are compared to determine whether they refer to the same real-world entity. Recent work has incorporated domain information and low-resource learning techniques to better adapt EM systems to realistic settings. While these approaches have demonstrated strong performance, it remains unclear how they behave under varying data constraints and levels of supervision in practice. In this paper, we investigate a state-of-the-art method for low-resource, domain-aware EM–BEACON–and study how its performance is affected by different algorithmic choices and data availability conditions. We conduct a series of targeted experiments to evaluate these variations, providing deeper insight into the role of distribution alignment and the behavior of the BEACON framework.
[AI-2] Language-Based Digital Twins for Elderly Cognitive Assistance
链接: https://arxiv.org/abs/2606.27334
作者: Mohammad Mehdi Hosseini,Mohammad H. Mahoor,Hiroko H. Dodge
类目: Artificial Intelligence (cs.AI)
备注: Accepted and published in the Proceedings of the ACM International Conference on PErvasive Technologies Related to Assistive Environments (PETRA 2026). The final published version is available through the ACM Digital Library
Abstract:Digital twins have emerged as a promising paradigm for personalized healthcare, enabling modeling of individual behavior and health trajectories. In cognitive health, early detection of Mild Cognitive Impairment (MCI) remains challenging, where language and conversational patterns serve as non-invasive biomarkers. In this work, we propose a language-based digital twin framework that leverages large language models (LLMs) to mimic the conversational behavior of elderly individuals by incorporating stylometric cues and contextual metadata. To evaluate fidelity and cognitive consistency, we introduce a multi-head conditional variational autoencoder (cVAE) that jointly measures reconstruction quality and predicts cognitive scores. Experiments on the I-CONECT dataset show that the digital twin preserves identity-specific characteristics and achieves reconstruction and MoCA prediction errors comparable to real data, while outperforming baseline GPT-generated responses. These results highlight the potential of language-based digital twins as a scalable and non-invasive approach for personalized and continuous cognitive health monitoring.
[AI-3] Beyond the Hard Budget: Sparsity Regularizers for More Interpretable Top-k Sparse Autoencoders
链接: https://arxiv.org/abs/2606.27321
作者: Nathanaël Jacquier,Maria Vakalopoulou,Mahdi S. Hosseini
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Sparse autoencoders (SAEs) have become a leading tool for interpreting the representations of vision foundation models, decomposing their polysemantic activations into a larger set of sparse, more monosemantic features. The Top- k SAE, a now-standard variant, enforces sparsity architecturally through its activation function, retaining only the k most active latents per input. Because it was designed precisely to avoid the \ell_1 penalty used by earlier SAEs and its known drawbacks, it has not been combined with an explicit sparsity regularizer, despite retaining limitations of its own, such as a budget k that is fixed regardless of input complexity and a tendency to overfit to the training value of k . We introduce two sparsity regularizers compatible with the Top- k architecture, both acting on the activations before the Top- k selection: an \ell_1 penalty on the unselected (off-support) units, and a scale-invariant \ell_1/\ell_2 -ratio penalty that concentrates the code onto fewer effective units. Both penalties are applied only to the batch-active units, those selected by the Top- k operator at least once within the batch. Across two datasets, three vision foundation models, and a range of k , both regularizers consistently improve monosemanticity at no cost to reconstruction quality. The \ell_1/\ell_2 penalty further concentrates information into fewer latents, making reconstruction more robust to the inference-time choice of k and improving small-budget linear probing. Our central finding is that hard architectural sparsity and soft sparsity regularization are complementary rather than mutually exclusive.
[AI-4] When Does Combining Language Models Help? A Co-Failure Ceiling on Routing Voting and Mixture-of-Agents Across 67 Frontier Models
链接: https://arxiv.org/abs/2606.27288
作者: Josef Chen
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Multi-model LLM systems such as routing, voting, cascades, fusion, and mixture-of-agents are used to beat single-model accuracy. We show that their gain is capped by a quantity the field rarely reports. For any policy whose output is one member model answer, accuracy cannot exceed one minus beta, where beta is the rate at which every model is wrong on the same query. In contrast, the usual diagnostic, average pairwise error correlation rho, cannot identify beta: error laws with identical marginals and pairwise correlations can have different all-wrong rates. A Clopper-Pearson bound on beta gives a finite-sample certificate on the largest gain any router, vote, or cascade could deliver before training a router. Across 67 models from 21 providers, a tetrachoric-calibrated single-factor model still underprices the all-wrong tail: on open-ended mathematics, observed beta is 0.052 versus 0.023 under the full 67-model Gaussian copula, about 2.5 times underpricing, with 90 percent CI 1.7 to 3.4 and k equals 17. The effect recurs on execution-graded code, where beta is 0.079. Re-asking the same GPQA-Diamond questions in free-response rather than multiple-choice form reopens the tail, with beta 0.127 and a five-judge panel with kappa 0.73 to 0.92, locating co-failure in answer format rather than subject. At matched quality, low-rho heterogeneous ensembles beat high-rho Self-MoA, but on checkable tasks in our pool, combining models rarely beats the single best model without a strong query-level routing signal. Gains come from models failing on different questions, not from adding more models. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.27288 [cs.AI] (or arXiv:2606.27288v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.27288 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-5] Prompt Injection in Automated Résumé Screening with Large Language Models : Single and Multi-Injection Settings
链接: https://arxiv.org/abs/2606.27287
作者: Preet Baxi,Jiannan Xu,Jane Yi Jiang,Stefanus Jasin
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly used to screen and rank job applicants, creating incentives for candidates to strategically manipulate algorithmic hiring systems. We study prompt injection in automated résumé screening, defined as subtle self-promotional text that introduces no new qualifications but is designed to influence LLM evaluations. Using controlled experiments, we show that prompt injection reliably improves applicant rankings when résumé quality is homogeneous and few candidates inject. However, its effectiveness rapidly diminishes as more candidates inject, collapsing when manipulation becomes widespread. When candidate quality is heterogeneous, prompt injection is less effective on average, but can occasionally allow lower-quality candidates to outrank higher-quality ones, raising fairness concerns. Overall, LLM-based screening is most vulnerable when manipulation is rare and candidate quality differences are small. Code and resources are publicly available at: this https URL
[AI-6] Simulation-based inference for rapid Bayesian parameter estimation in epidemiological models: a comparison with MCMC
链接: https://arxiv.org/abs/2606.27286
作者: Alina Bazarova,Johann Fredrik Jadebeck,Henrik Zunker,Carolina J. Klett-Tammen,Torben Heinsohn,Wolfgang Wiechert,Katharina Noeh,Stefan Kesselheim
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Mechanistic epidemiological models are widely used to support infectious disease forecasting and public-health decision making. Bayesian calibration of such models is commonly performed using Markov chain Monte Carlo (MCMC), which can become computationally expensive for high-dimensional nonlinear systems and repeated near-real-time analyses. Here, we investigate simulation-based inference (SBI) using neural posterior estimation as a scalable alternative for Bayesian calibration of a mechanistic SECIR epidemiological model using COVID-19 intensive care unit (ICU) occupancy data from Germany during 2020. We compared SBI and MCMC across multiple epidemic phases using both 31-day inference windows and a substantially more challenging 201-day reconstruction problem involving multiple transmission change points. Posterior agreement was evaluated quantitatively using Wasserstein distances and Kullback-Leibler divergences together with posterior predictive checks. Across the 31-day windows, SBI recovered posterior distributions in strong agreement with MCMC while accurately reproducing observed ICU trajectories. In the 201-day setting, SBI preserved the dominant posterior structure despite increased uncertainty. SBI, by combining CPU and GPU resources, substantially reduced computational runtime compared with MCMC, which was restricted to running on CPUs. Whereas MCMC required approximately 1000 seconds for the 31-day inference problems, SBI achieved comparable posterior and predictive performance in approximately 60-70 seconds on a single GPU. For the 201-day inference problem, SBI required an average of 157 seconds, while the MCMC runs took over 19,000 seconds. Our results demonstrate that SBI provides a rapid and computationally efficient framework for Bayesian calibration of mechanistic epidemiological models, supporting repeated near-real-time inference and rapid outbreak analysis.
[AI-7] E-TTS: A New Embodied Test-Time Scaling Framework for Robotic Manipulation ECCV2026
链接: https://arxiv.org/abs/2606.27268
作者: Wen Ye,Peiyan Li,Tingyu Yuan,Yuan Xu,Xiangnan Wu,Chaoyang Zhao,Jing Liu,Nianfeng Liu,Yan Huang,Liang Wang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to ECCV 2026. 44 pages, 11 figures. Project page: this https URL
Abstract:Recently, a few works have made early attempts to study test-time scaling for embodied tasks. However, two major challenges remain unsolved: (1) reasoning can effectively improve the performance of the policy, but its scaling mechanism has seldom been studied; (2) historical information is essential, as embodied tasks are inherently long-horizon and sequential, making sole reliance on current observations for action scaling inadequate due to the lack of historical context utilization. To address these challenges, we introduce E-TTS, a modular and plug-and-play Embodied Test-Time Scaling framework that unifies reasoning and action scaling for robotic manipulation via history-aware iterative refinement with vision-language verifiers. To support joint reasoning-action scaling, E-TTS performs reasoning-action joint sampling and scoring in a pairwise manner. To better utilize historical information, E-TTS uses a history buffer to store historical context, which is then used by reasoning and action verifiers to evaluate the sampled candidates. Unlike conventional open-loop TTS methods, E-TTS introduces feedback generation into the sampling process to form a closed-loop iterative refinement mechanism, enhancing both inference efficiency and environmental adaptability. Each component functions as an independent and composable module, allowing flexible and adaptive configuration depending on task requirements. To evaluate the advantages of our framework, we conduct experiments across 4 different benchmarks, 6 environments, 3 embodiments, and 4 base vision-language-action models. The experimental results demonstrate that, without requiring additional expert data collection or retraining, E-TTS consistently improves performance, achieving up to a 33.14% increase in simulation and 26.62% in real-world scenarios.
[AI-8] Advancing Omnimodal Embodied Agents from Isolated Skills to Everyday Physical Autonomy
链接: https://arxiv.org/abs/2606.27251
作者: Junhao Shi,Zezheng Huai,Siyin Wang,Jia Chen,Yubang Wang,Zhaoye Fei,Hechang Chen,Jingjing Gong,Xipeng Qiu,Yu-Gang Jiang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Building persistent embodied agents in unstructured environments demands unified orchestration of heterogeneous tools spanning both cyber (APIs, IoT) and physical (manipulation, navigation) domains, coupled with autonomous recovery from physical failures that inevitably arise over extended operation. Existing systems treat these as separate problems: VLM-based planners lack a unified cyber-physical action space, agent frameworks accumulate unbounded context that degrades temporal coherence, and VLA policies execute open-loop without detecting their own failures. We argue that persistent autonomy requires not a monolithic model but a hierarchical asynchronous architecture with explicit separation of planning, memory, and verification. To this end, we present OmniAct, a framework integrating a multimodal semantic planner for skill routing across unified action spaces, an adaptive hierarchical memory with event-boundary-driven compression for sub-linear context growth, and an asynchronous visual preemption engine that closes the semantic loop during physical execution. Across 40 real-world long-horizon tasks on two robotic platforms coordinating four IoT devices, OmniAct achieves consistent improvements in end-to-end success across all complexity levels, maintains near-flat token consumption over under 100k+ accumulated interaction tokens, and elevates mid-scale open-weight models to proprietary-level performance.
[AI-9] Vulnerability of Natural Language Classifiers to Evolutionary Generated Adversarial Text
链接: https://arxiv.org/abs/2606.27215
作者: Manjinder Singh,Alexander E. I. Brownlee,Mohamed Elawady
类目: Artificial Intelligence (cs.AI)
备注: 24 pages
Abstract:Deep learning models have achieved impressive performance across various fields but remain vulnerable to adversarial inputs, particularly in NLP, where such attacks can have significant real-world consequences. Adversarial attacks often involve small, semantically similar token replacements to fool NLP models, and recent methods have become more precise by targeting specific vulnerable words, often by exploiting some level of access to the model’s internal structure. This paper proposes GAversary, a hybrid Genetic Algorithm (GA) to generate adversarial attacks on natural language models. The GA is able to treat the target model as a black box, requiring only the logit value output by the model to guide the search. GAversary differs from GAs previously proposed for this problem by using GloVe embeddings to propose word replacements (the mutation operator) to improve the semantic similarity of the adversarial examples. GAversary is applied to several benchmark data sets and well-known target models. GAversary is able to substantially reduce the target model’s accuracy on test data compared to the BAE and A2T attacks compared against (in the best case, reducing a 76.8% accuracy to 5.8%, compared to BAE’s 27.6%). The trade-off is that GAversary perturbs just under twice as many words as the other two methods, with a slightly lower semantic similarity to the original text and around a 5% increase in run-time.
[AI-10] A Process Harness for Uplifting Legacy Workflows to Agent ic BPM: Design and Realization in CUGA FLO
链接: https://arxiv.org/abs/2606.27188
作者: Fabiana Fournier,Lior Limonad
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 5 figures
Abstract:We introduce the process harness, a new mechanism for uplifting legacy workflows into Agentic Business Process Management (Agentic BPM) without replacing the underlying workflow engine. A process harness places a policy-governed agentic layer around a deterministic workflow engine, intercepting designated control points to contribute reasoning, adaptation, and oversight while the engine retains structural authority over the process. To define the process harness rigorously, we develop the Task-Decision-Flow (TDF) model, specifying both its data schema and its execution semantics. TDF decomposes LLM reasoning across three policy-governed agent types: a TaskAgent for knowledge-intensive task execution, a DecisionAgent for per-case gateway routing, and a FlowAgent that governs runtime flow adaptation through a principled hook mechanism. Each agent reasons within an explicit policy drawn from the process FRAME, the aggregate policy set governing all LLM calls in the system. We then present CUGA FLO as the design and implementation realization of the TDF model, and demonstrate it on a loan approval workflow that exercises all three agent types and hook-driven regulatory override. The process harness uniquely reconciles imperative requirements, realized through deterministic workflow execution that enforces structural compliance, with normative requirements, realized through policy-framed agentic autonomy invoked at designated control points wherever the process demands it.
[AI-11] Automating Potential-based Reward Shaping with Vision Language Model Guidance
链接: https://arxiv.org/abs/2606.27180
作者: Henrik Müller,Daniel Kudenko
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Sparse rewards are inherently challenging for reinforcement learning agents as they lack intermediate feedback to guide exploration and to correctly attribute the sparse success rewards to relevant parts of the trajectory. Naive reward shaping can induce reward hacking, yielding policies that exploit auxiliary signals instead of solving the intended task. Potential-based reward shaping (PBRS) guarantees preservation of the optimal policy set, but requires the definition of a heuristic potential function over the state space. In this work, we introduce the VLM-guided PBRS framework VLM-PBRS that learns the potential function directly from vision language model (VLM) feedback. We query a lightweight VLM to obtain preferences over image pairs and train a model of the potential function using these preferences. As this approach is based on potential-based reward shaping, it preserves the original optimal policies, and removes the need for expert-designed reward shaping terms. Because large VLMs are prohibitively expensive to invoke repeatedly during policy learning, we employ smaller, more computationally efficient VLMs. Although the resulting preference labels are less accurate, empirical evidence shows that the preference labels can still be used to accelerate learning. We validate our method empirically in the Meta-World and Franka Kitchen environments and highlight the connection between VLM preference label accuracy and sample efficiency improvements. Our contributions are threefold: (1) the first application of VLM preference-based learning to synthesize a potential function for PBRS, (2) a principled, low-cost solution that leverages small VLMs, and (3) extensive empirical demonstration of improved sample efficiency and robustness to reward hacking.
[AI-12] Learning to Fold: prizewinning solution at LeHome Challenge 2026 (1st place online 2nd offline) ICRA2026
链接: https://arxiv.org/abs/2606.27163
作者: Ilia Larchenko
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Solution of the LeHome Challenge at ICRA 2026
Abstract:I describe my solution to the LeHome Challenge 2026, an ICRA 2026 competition on bimanual garment folding. The system placed 1st of 62 teams in the online (simulation) round and 2nd in the real-world final. It improves a vision-language-action (VLA) policy with a reinforcement-learning loop. The policy is its own value function: the same network that predicts actions also predicts success, progress, and a few task-relevant future quantities, and those predictions drive advantage estimation, live failure detection, and candidate selection. The work mostly recombines existing RL ideas with engineering and optimization contributions that can be used together as one recipe or individually: AWR + RECAP combined for flow-matching VLA; an asynchronous distributed training / rollout pipeline through HuggingFace Hub; inference-time hyperparameters optimization via Thompson sampling; a sim-to-real recipe with camera-alignment tooling, heavy augmentation and DAgger-like HIL data collection.
[AI-13] OPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference
链接: https://arxiv.org/abs/2606.27161
作者: Tinghao Wang,Yichen Guo,Rui Huang,Zheng Lu,Qizhe Zhang,Chenxi Li,Yuan Zhang,Jiajun Cao,Zhirong Shen,Yaosong Du,Guangyan Gan,Wenya Wang,Lin William Cong,Shanghang Zhang
类目: Artificial Intelligence (cs.AI)
备注: 27 pages, 18 figures
Abstract:Multimodal large language models (MLLMs) have achieved strong multimodal reasoning capabilities, but their efficiency is limited by the large number of visual tokens, which introduces substantial computational overhead. Visual token pruning offers a natural solution, yet existing methods are imperfect: attention-based criteria tend to retain redundant tokens, while diversity-based criteria are often agnostic to user instructions. Even methods that combine multiple criteria still lack a principled formulation of the intrinsic objective of token pruning. In this paper, we revisit visual token pruning from a first-principles perspective and formulate it as constructing Token Optimal Preservation Sets. Through a top-down information-theoretic analysis, we identify three fundamental principles for effective token selection: Task Relevance, Information Coverage, and Semantic Diversity. Based on these principles, we propose TOPS, a training-free and model-agnostic pruning module that can be applied to various MLLMs. Extensive experiments on 7 MLLM backbones and 14 benchmarks demonstrate that TOPS outperforms prior methods under diverse pruning settings. Notably, on LLaVA-NeXT, TOPS removes 77.8% of visual tokens while preserving 100.0% and 100.6% performance on its 7B and 13B models, respectively, suggesting that pruning redundant visual tokens can sometimes mitigate hallucination and inspire future lightweight MLLM design.
[AI-14] OpenRCA 2.0: From Outcome Labels to Causal Process Supervision
链接: https://arxiv.org/abs/2606.27154
作者: Aoyang Fang,Yifan Yang,Jin’ao Shang,Qisheng Lu,Junjielung Xu,Rui Wang,Songhan Zhang,Yuzhong Zhang,Boxi Yu,Pinjia He
类目: Artificial Intelligence (cs.AI)
备注: work in progress
Abstract:Root cause analysis (RCA) poses a holistic test of LLM agentic capabilities, such as long-context understanding, multi-step reasoning, and tool use. However, existing datasets suffer from a fundamental gap: they label only the root cause, not the propagation path connecting it to the observed symptom, which largely simplifies the task to naive pattern matching. To support rigorous evaluation, we introduce PAVE, a step-wise labeling protocol that leverages known interventions from fault injection to reconstruct causal propagation paths. The mechanism is forward verification: reasoning from cause to effect rather than inferring backward from symptoms. Applying PAVE yields OpenRCA 2.0 (500 instances), the first cross-system RCA benchmark with step-wise causal annotations for LLM agents. Across 11 frontier LLMs, recovering the exact root-cause set succeeds in only 20.7% of cases on average. To locate where this difficulty lies, we relax the criterion and find what we call the ungrounded diagnosis: agents identify at least one correct root-cause service in 76.0% of cases, but ground that service in a verified causal propagation path to the observed symptom in only 61.5%. Outcome-only evaluation hides this failure mode; step-wise causal ground truth is the missing piece for trustworthy LLM-based RCA agents.
[AI-15] Joint Learning of Experiential Rules and Policies for Large Language Model Agents
链接: https://arxiv.org/abs/2606.27136
作者: Shicheng Ye,Chao Yu
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:For LLM agents in multi-step interactive environments, a key challenge is to make effective use of accumulated interaction experience. Existing work has typically separated two uses of such experience: keeping it outside the model as natural-language rules for later prompting, or using trajectories and feedback to update the model parameters. The former is easy to interpret but can fall out of sync with the evolving policy; the latter improves the policy more broadly but provides only limited correction for local mistakes in sparse-reward settings. We present Joint Learning of Experiential Rules and Policies for LLM Agents (JERP), which updates a long-term experiential-rule pool and the policy from the same interaction trajectories. At decision time, JERP retrieves task-relevant rules and conditions the agent on them together with the interaction history. After each episode, it uses the collected trajectories both to optimize the policy and to revise the rule pool by comparing current rollouts with reference successful trajectories. This coupling keeps the rule pool aligned with the evolving policy while allowing stable and effective behaviors to be gradually absorbed into the model itself. Experiments on AlfWorld and WebShop show that JERP yields consistent gains in decision performance for complex interactive tasks.
[AI-16] Heavy-Ball Q-Learning with Residual Weighting Correction
链接: https://arxiv.org/abs/2606.27112
作者: Donghwan Lee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper proposes a corrected heavy-ball Q-learning method for reinforcement learning (RL) and establishes its convergence. It also identifies conditions under which the method is theoretically guaranteed to converge faster than standard Q-learning. The same construction is then extended to Q-learning with linear function approximation, where analogous convergence and acceleration statements are derived. The analysis is based on a switched linear system (SLS) representation of Q-learning algorithms and on the joint spectral radius (JSR) of the associated switching families. This SLS viewpoint is not commonly used in standard analyses of Q-learning, and it provides a complementary framework and new insight into how heavy-ball momentum can accelerate Q-learning.
[AI-17] Application of LLM s to Threat Assessment of Foreign Peacekeeping Missions
链接: https://arxiv.org/abs/2606.27106
作者: Gerhard Backfried,Christian Schmidt,Diego Pilutti,Michael Suker
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:We present a novel approach for applying Large Language Models (LLMs) to threat assessment in the context of foreign peacekeeping missions. Building on the PINPOINT project and its use case, the EU Monitoring Mission in Georgia, we combine an interdisciplinary risk-model with OSINT-based media collection and LLM-supported threat extraction. The proposed workflow maps media contents to mission-relevant threats, extracts structured information and applies several additional LLM-based processing steps to improve relevance and grounding. An evaluation of threats extracted from media documents shows high agreement between automatically generated results and human judgment for core aspects such as threat and mission relevance. These results indicate that LLMs provide a promising approach to support analysts in the context of peacekeeping missions.
[AI-18] Data-Free Reservoir Features for Efficient Long-Horizon Cold-Start Continual Learning
链接: https://arxiv.org/abs/2606.27095
作者: Augustinas Jučas,Yangchen Pan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Cold-start exemplar-free class-incremental learning requires learning a growing set of classes without replay, external pretraining, or a large initial task. Existing cold-start methods typically either train the backbone throughout the stream and compensate for semantic drift, or freeze a backbone after the first task, producing features biased toward the initial classes. These choices also create a computational tension: drift-compensation methods require repeated backbone training and increasingly expensive updates as the task horizon grows, while frozen-backbone methods are cheap but weak under cold start. We study a third option: a feature extractor that is never fit to image data at all. We propose CIRCLE, a class-incremental classifier built from fixed bidirectional two-dimensional reservoir features, adapted from BiRC2D for image classification, and streaming linear discriminant analysis heads. CIRCLE groups multiple random reservoir instantiations into feature ensembles and averages the softmax outputs of independent SLDA heads, yielding a tunable bias-variance tradeoff between richer random features and prediction-level ensembling. Because the feature extractor is fixed and the head admits streaming closed-form updates, CIRCLE performs sample-wise training without replay, task-boundary information, or backbone backpropagation. On CIFAR-100, TinyImageNet, ImageNet-Subset, and ImageNet-1k, CIRCLE is competitive at 10-20 task splits and substantially outperforms strong CS-EFCIL baselines at 50, 100, and 500 task splits, while training much faster than trained-backbone drift-compensation methods. Ablations show that the BiRC2D-style extractor, SLDA head, and balanced feature/prediction ensembling each contribute to the final performance.
[AI-19] Inherited Circuits Learned Semantics: How Fine-Tuning Creates Evasion Vulnerabilities Invisible to Standard Evaluation
链接: https://arxiv.org/abs/2606.27091
作者: Ryan Fetterman
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:LLMs fine-tuned for security classification are usually evaluated on held-out examples from the same distribution as their training data. We show that this can miss vulnerabilities introduced by fine-tuning itself: models can learn token-level indicator semantics that preserve canonical accuracy while failing under behavior-preserving transformations such as PowerShell alias substitution, command reconstruction, string construction, execution indirection, and case mutation. We study Foundation-Sec-8B-Instruct and its base model, Llama-3.1-8B-Instruct, on matched PowerShell classification cohorts. Causal interventions localize the classification circuit to a late-attention route inherited from Llama rather than created by fine-tuning. Fine-tuning concentrates and semantically specializes this inherited structure, improving baseline behavior while creating transformation-sensitive attack surfaces. A three-tier evasion benchmark finds Foundation-Sec misses on iwr substitution, Invoke-Expression reconstruction, and case-mutated Invoke-Expression/IEX variants that Llama does not share. We also derive a pre-deployment monitoring method: a linear probe at the classification boundary and an indicator-token sign test identify command families where canonical indicators change role after fine-tuning. These signals prioritize red-team variant generation using only canonical inputs, showing that security fine-tuning can improve task accuracy while expanding the evasion surface. These results caution against treating small task-specific fine-tunes as straightforwardly safer security classifiers: specialization can convert inherited model structure into brittle indicator rules that preserve held-out accuracy while expanding the evasion surface. Robust AI-enabled security will require specifying the full transformation space of the task and monitoring semantic drift through fine-tuning.
[AI-20] Parametric Open Source Games ICML
链接: https://arxiv.org/abs/2606.27068
作者: Aleksandar Todorov,Jesse ten Napel,Alexander Müller
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICML Workshop New Frontiers in Game-Theoretic Learning-NExT-Game
Abstract:Open-source game theory studies agents whose behavior may depend on one another’s decision procedures, but most existing models use discrete or symbolic programs. We introduce parametric open-source games, a continuous analogue of program equilibria in which players choose parameter vectors and semantics maps convert the full parameter profile into mixed actions in an underlying finite game. We establish equilibrium existence results, derive an exact coupling threshold at which selfish gradient ascent in symmetric 2\times2 games switches from defection toward cooperation, and give a one-dimensional boundary test for parametric program Nash equilibria. We further extend the framework to a neural semantics class whose first-order cooperation condition is governed by the ratio of cross-player to self-player sensitivity. Across canonical games, the framework shows how access to internal parameterizations can qualitatively reshape learning dynamics and equilibrium structure, and how sufficiently strong open-source coupling can steer selfish optimization toward cooperative outcomes.
[AI-21] How to evaluate clustering with ground truth?
链接: https://arxiv.org/abs/2606.27061
作者: Pasi Fränti
类目: Artificial Intelligence (cs.AI)
备注: Preprint of a book chapter to appear: P. Fränti, “How to evaluate clustering with ground truth?”, In Center-based clustering, Springer Nature, 2026
Abstract:External indexes can be used for cluster evaluation when ground truth is available. We review the most common external validity indexes focusing on set-matching-based measures. We recommend centroid index (CI), because it is an intuitive cluster-level measure with an explainable result. If we need a more fine-tuned, point-level measure, there are more choices. Pair-set index (PSI) provides a normalized score which is not biased by cluster sizes. If all points should matter equally, then clustering accuracy (ACC) or any other set-matching measure is suitable.
[AI-22] he Spec Growth Engine: Spec-Anchored Code-Coupled Drift-Enforced Architecture for AI-Assisted Software Development
链接: https://arxiv.org/abs/2606.27045
作者: Hartwig Grabowski
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:AI coding agents dramatically accelerate implementation speed but introduce two structural failure modes that existing spec-driven approaches do not fully solve: (1) context explosion – the agent must reason over an entire repository at once, degrading output quality as the context window fills; and (2) silent spec-code drift – code evolves, the specification does not, and the divergence becomes invisible until it is costly to repair. We present the Spec Growth Engine, a lightweight framework that addresses both failure modes through a machine-readable spec graph whose nodes carry explicit contract/design separation, a Spine context assembler that scopes agent context to an ownership path, a vertical-slice growth protocol that enforces hardest-first ordering, and a drift gate that makes spec-code divergence a blocking merge condition. The design synthesises well-established software engineering principles (Parnas information hiding, C4, ADRs, Walking Skeleton, Reflexion Models, Fitness Functions) into a lean, code-coupled, machine-enforced whole – without the overhead of heavy-weight frameworks such as RUP or MDA. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.27045 [cs.SE] (or arXiv:2606.27045v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2606.27045 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-23] State Representation Matters in Deep Reinforcement Learning: Application to Energy Trading
链接: https://arxiv.org/abs/2606.27032
作者: Jesper Klicks,Sander Vržina,Vincent François-Lavet
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Energy trading decisions depend not only on current market prices, but also on expected future market conditions, and operational constraints. This makes the state representation given to a reinforcement learning agent an important design choice. We study this in HydroDam, a pumped-storage arbitrage environment, using a fixed Double DQN agent. The environment, action space, reward function, network, and training protocol are kept fixed; only the market features are changed. We compare absolute price/calendar features, relative features that compare current prices with recent market history, forecast features, and all combinations of these three feature families. Policies are trained and selected using 2007–2011 Belgian day-ahead prices and evaluated on two test settings: a later same-market test set from 2012–2025 and 39 other ENTSO-E market zones. Absolute features only reaches 28.8% on the test set and a median 5.7% across zones. Relative-only and forecast-only states also stay below a rolling price-score heuristic in the cross-zone median. Combining feature families is much stronger: absolute + relative reaches 49.9% on the test set and a 39.8% cross-zone median, while absolute + relative + forecast reaches 55.6% and 47.5%. These results suggest that state representation is not a minor preprocessing choice in storage-trading RL, but a central part of the policy design: robust transfer requires combining price scale, recent relative price context, and short-horizon forecast information, rather than relying on any single feature family.
[AI-24] ShareLock: A Stealthy Multi-Tool Threshold Poisoning Attack Against MCP
链接: https://arxiv.org/abs/2606.27027
作者: Liwei Liu,Tianzhu Han,Zijian Liu,Zishu Dong,Na Ruan
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 16 pages, 12 figures
Abstract:With the rapid evolution of LLM-driven agents, Model Context Protocol (MCP), an open protocol bridging LLMs with external tools, has quickly become foundational to modern agent ecosystems. However, the expanding adoption of MCP has also introduced novel security concerns such as Tool Poisoning Attack (TPA), which exploit LLM-server interactions to inject malicious prompts. Existing poisoning schemes typically adopt a monolithic plaintext embedding paradigm, which fails to withstand manual inspection or automated detectors. Current research still lacks a systematic analysis on multi-tool poisoning, where multiple tools can be exploited cooperatively to disperse detection risk. In this paper, we introduce ShareLock, a multi-tool threshold poisoning framework that utilizes Shamir’s threshold scheme to ensure exceptional stealth and fault tolerance. ShareLock distributes the malicious instruction as benign-looking secret shares across multiple tool descriptions, achieving both information-theoretic secrecy and attack robustness against moderate auditing. After a covert reconstruction trigger is planted during server update, the aggregated shares reconstruct the hidden instruction, resulting in critical breaches of system assets or private data. To evaluate the realistic threat of ShareLock, we constructed a comprehensive benchmark encompassing four multi-tool scenarios and conducted extensive experiments across mainstream LLMs on two distinct MCP clients. Our results demonstrate that ShareLock significantly outperforms existing single-tool poisoning strategies in tool description-based detection while maintaining an average attack success rate exceeding 90%.
[AI-25] Adaptive Utility driven Resource Orchestration for Resilient AI (AURORA-AI)
链接: https://arxiv.org/abs/2606.27005
作者: Rahul Umesh Mhapsekar,Ilias Cherkaoui,Lizy Abraham,Indrakshi Dey
类目: Artificial Intelligence (cs.AI)
备注: Accepted at IEEE Research and Technologies for Society and Industry 2026 conference
Abstract:Modern AI systems are increasingly deployed under non-stationary computational, demographic, and operational conditions in which static resource allocation strategies degrade both predictive performance and human-centric properties such as fairness and explainability. This paper presents AURORA-AI, an Adaptive Utility-driven Resource Orchestration framework for Resilient AI that unifies Hamilton-Jacobi-Bellman feedback control, Lyapunov-based stability monitoring, and a fairness-aware composite utility into a single closed-loop this http URL framework continuously redistributes computational budget across a population of heterogeneous AI models so that the global utility, defined jointly over predictive performance, demographic parity, cost, latency, robustness, and interpretability, remains maximised under disruption. The framework is evaluated in a stress-rich discrete-time simulation that concurrently injects demographic bias shocks, gradual concept drift, and abrupt black-swan disruptions, and is compared against five established controllers including Static, Round Robin, Greedy, LinUCB, and a deep reinforcement-learning agent based on Proximal Policy Optimisation. AURORA-AI achieves immediate recovery from the black-swan event compared to eighty-eight time steps for the Static baseline and twenty-two for Proximal Policy Optimisation, lifts the alpha-quantile and the super-quantile by twenty-nine and twenty-five percent respectively, simultaneously reduces the mean and maximum demographic parity gap, and increases the fraction of Lyapunov-stable operating steps. These results indicate that fairness-aware adaptive orchestration grounded in stability theory is a practical and theoretically motivated path toward resilient human-centric AI deployment.
[AI-26] Decision-Aligned Evaluation of Uncertainty Quantification
链接: https://arxiv.org/abs/2606.26990
作者: Annika Schneider,Tommy Rochussen,Joshua Stiller,Vincent Fortuin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Uncertainty estimates in machine learning are typically evaluated using generic metrics such as the negative log-likelihood and expected calibration error, yet good performance on such metrics does not necessarily imply high utility in downstream decisions. We introduce decision-alignment, a criterion that reveals which evaluation metrics meaningfully align with downstream utilities. Applying this framework, we show that many widely used uncertainty metrics are either misaligned with common decision problems or encode pathological prior beliefs about the downstream task. We then propose prior-weighted utility metrics, a special class of proper scoring rules that provides decision-aligned uncertainty evaluation. Across benchmark experiments and real-world case studies, our metrics consistently align with realized decision utility, while conventional metrics do not. Our results surface flaws in the current UQ evaluation protocol and offer a principled extension of existing metrics toward decision-relevant UQ evaluation.
[AI-27] In-Context Model Predictive Generation: Open-Vocabulary Motion Synthesis from Language Models to Physics
链接: https://arxiv.org/abs/2606.26981
作者: Xiaomeng Fu,Junfan Lin,Yang Liu,Yaowei Wang,Guanbin Li,Liang Lin,Ziliang Chen
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Synthesizing human motion from textual descriptions is essential for immersive digital applications, yet existing methods face a persistent trade-off between semantic fidelity and physical realism. Large language model (LLM)-based approaches can interpret diverse open-vocabulary instructions and compose high-level action plans, but they often generate motions that violate physical constraints. Physics-aware models improve realism through simulation or control, but they struggle with semantic complexity, fine-grained instructions, and novel concepts. To address this gap, we propose In-Context Model Predictive Generation (ICMPG), a framework that integrates language-model planning with inference-time physical feedback. ICMPG reformulates motion synthesis as a Model Predictive Control (MPC)-like process with two modules. The Context-Aware Motion Generation (CAMG) module uses an LLM as a planner to decompose textual commands and generate candidate motion sequences from motion tokens. The Model Predictive Generation (MPG) module evaluates these candidates through physical simulation and semantic alignment, estimates a composite reward, and selects the best sequence to guide subsequent generation steps. Unlike open-loop generation, this closed-loop refinement enables ICMPG to adapt motions to both the input semantics and the simulated physical environment without task-specific policy retraining. Extensive experiments across standard and zero-shot open-vocabulary settings show that ICMPG generalizes robustly to diverse commands and produces motions that are more physically plausible and semantically faithful than representative baselines on the evaluated benchmarks. The framework bridges semantic interpretation and physical simulation while remaining flexible enough to incorporate different LLM backbones, enabling more versatile and controllable text-driven motion synthesis.
[AI-28] Where Do CoT Training Gains Land in LLM based Agents ?
链接: https://arxiv.org/abs/2606.26935
作者: Jingyu Liu,Zhiwen Wang,Yuxin Jing,Huanyu Zhou,Yong Liu
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Chain-of-thought (CoT) reasoning is widely used in language-model agents, but prior work has shown that verbalized CoT is not always faithful and may instead reflect post-hoc reasoning, which means the model already knows the answer before reasoning. We therefore ask what CoT training is actually improving: is the model getting better at changing its action through generated reasoning, or is it getting better at predicting the action directly from the prompt? We study this question by comparing \emphprompt actions (predicting action without CoT) with CoT actions (predicting action with CoT). Across checkpoints, prompt-action quality improves substantially. While interacting with the environment, the relative advantage of CoT actions over prompt actions remains similar, showing that CoT training does not widen the advantage of CoT reasoning, and it helps to improve the quality of prompt actions. We further find that later checkpoints are less likely to revise the action in response to CoT, suggesting greater reliance on the prompt. Motivated by these patterns, we selectively mask action-token supervision on a fraction of training examples. This intervention improves out-of-domain generalization.
[AI-29] Chai: Agent ic Discovery of Cryptographic Misuse Vulnerabilities
链接: https://arxiv.org/abs/2606.26933
作者: Corban Villa,Sohee Kim,Austin Chu,Alon Shakevsky,Raluca Ada Popa
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:AI-assisted vulnerability discovery has proven effective for bug classes like memory safety, where instrumentation confirms memory violations and efficiently filters false positives. Many dangerous vulnerability classes, such as cryptographic misuse, however, lack any comparable instrumentation. In this work, we present Chai, an AI-based system that discovers and validates cryptographic misuse vulnerabilities through naturally occurring signals. To achieve this, Chai rethinks the classical technique of differential testing by leveraging AI to 1) improve precision for detecting real security issues in libraries, and 2) repurpose commonly overlooked discrepancies as leads for tangible vulnerabilities in downstream applications. In doing so, Chai inverts the prevailing paradigm of AI vulnerability discovery: instead of auditing one codebase for many flaws, it catalogs flaws at the library level and propagates them across a cryptographic dependency graph, delivering compounding efficiency gains. We evaluate Chai across X.509, JWT, and SAML libraries. Chai discovered a previously unknown critical vulnerability in an SSL library that powers billions of devices, along with security bugs in one library behind a major web browser and another in major Linux distributions. In total, these techniques surfaced over 100 vulnerabilities.
[AI-30] A Deterministic Control Plane for LLM Coding Agents
链接: https://arxiv.org/abs/2606.26924
作者: Padmaraj Madatha
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 45 pages, 9 figures, 13 tables. Dataset and reproduction scripts: Zenodo DOI https://doi.org/10.5281/zenodo.20780913 . Ancillary files include this http URL , this http URL , and figure-reproduction scripts
Abstract:LLM coding harnesses grant agents broad file and shell access, yet the configuration layer that steers them – rules files, agent definitions, IDE-specific markdown – is largely unmanaged. A prevalence study of 10,008 public GitHub repositories (n=6,145 agent config files) finds that agent configurations propagate as undeclared shared components: 10.1% of tracked paths are SHA-256 exact duplicates across independent repositories (fork-adjusted, threshold-independent), with 75.5% of clone pairs crossing organisational boundaries. Two further patterns are indicative: configurations are rarely revised (58% single-commit; 0.4 vs 0.6 commits/month age-normalised against CI/CD workflows), and rarely declare permission boundaries (1% of agent configs vs 33% of Actions workflows, n=31 true positives). We propose a deterministic control plane above the harness that maps one-to-one to these gaps. Rel(AI)Build treats agent definitions as a managed supply chain (SHA-256 content addressing, HMAC-stamped lockfiles, hash-chained audit logs); enforces tiered permissions and attack-derived blocklists before LLM invocation; gates feature work through a phase state machine with requirement-to-file-to-test traceability; compiles a single canonical definition to seven IDE targets; and detects prompt drift via Jaccard similarity. Conformance tests on injected violations confirm each mechanism enforces its stated invariant; developer outcomes remain future work. Governance of this layer must be deterministic and tool-agnostic – not delegated to further LLM orchestration. Comments: 45 pages, 9 figures, 13 tables. Dataset and reproduction scripts: Zenodo DOI https://doi.org/10.5281/zenodo.20780913. Ancillary files include this http URL, this http URL, and figure-reproduction scripts Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2606.26924 [cs.SE] (or arXiv:2606.26924v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2606.26924 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-31] Risk-Aware Selective Multimodal Driver Monitoring with Driver-State World Modeling
链接: https://arxiv.org/abs/2606.26922
作者: Daosheng Qiu,Haozhuang Chi,Hao Su,Shu Long,Xinyue Miao,Yongle Dong,Wei Zhang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Continuous driver monitoring in automated vehicles requires low-latency inference while avoiding unsafe decisions under uncertain driver states. Large vision-language models provide broad multimodal priors, but their latency and limited reliability in this setting make them unsuitable as always-on in-cabin monitors. We propose a cost-aware selective inference framework for deployable multimodal driver monitoring. The core system is a lightweight RGB-physiological student that combines in-cabin visual observations with window-level HR/EDA signals, and a learned gate that decides when to accept the fast prediction or abstain for safety intervention. Additional controls show that the learned scores contain sample-level information beyond scenario priors, while exact physiological synchronization remains a limitation. To incorporate predictive evidence, we further study a compact driver-state world modeling module that rolls out latent driver-state features and estimates future fast-model errors and counterfactual system-level action costs. On scenario-induced driver-demand recognition, the RGB-physiological student improves over RGB-only and physiology-only baselines, reaching 0.7440 Macro-F1 and 0.9099 balanced accuracy with 11.39M parameters and 3.08ms inference latency. Cost-aware selective inference reduces unsafe false negatives from 17.37% under always-fast inference to approximately 5% across seeds, while maintaining deployment-level latency. While driver-state world modeling offers valuable predictive signals, worst-group evaluations highlight persistent operating-point calibration drift. Ultimately, reliable edge driver monitoring requires advancing not only perception backbones, but also risk-aware selective control and group-robust calibration.
[AI-32] Diagnosing Task Insensitivity in Language Agents
链接: https://arxiv.org/abs/2606.26918
作者: Jingyu Liu,Xiaopeng Wu,Kehan Chen,Chuan Yu,Yong Liu
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models can serve as capable long-horizon agents, but their out-of-distribution (OOD) generalization remains weak. We identify a key source of this failure as task insensitivity: when faced with similar but distinct tasks, models might apply patterns learned during training and fail to solve the task at hand. We show that models often continue with actions aligned with the original task even when the instruction is semantically corrupted and cannot be directly answered. We further find that, when we replace the task description in a trained prompt with another similar but distinct task, the model may still output the same action. This behavior is accompanied by a consistent training-time attention drift away from task tokens and toward local observations, suggesting an optimization bias toward shortcuts. To mitigate this problem, we propose Task-Perturbed NLL Optimization, a lightweight contrastive regularizer that explicitly encourages action dependence on the task instruction. Extensive evaluations show that our intervention improves task sensitivity and OOD generalization while preserving more stable attention to task tokens.
[AI-33] GEOALIGN: Geometric Rollout Curation for Robust LLM Reinforcement Learning ICML2026
链接: https://arxiv.org/abs/2606.26917
作者: Ting Zhou,Zhenqing Ling,Yiyang Zhao,Ying Shen,Daoyuan Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted as a conference paper at ICML 2026
Abstract:Online reinforcement learning is widely used to align large language models (LLMs) with reward signals, yet training can be unstable under noisy or misspecified rewards. We identify a failure mode we call directional inconsistency: within a batch, a small set of high-reward rollouts induces representation-space preference directions that sharply disagree with the batch majority, resulting in high-variance and destabilizing updates. We propose geoalign, a lightweight plug-in for rollout curation in iterative policy optimization. Geoalign (i) forms within-prompt preference pairs, (ii) learns an online projector on per-rollout hidden states to concentrate reward-ordered displacement directions, and (iii) detects directionally inconsistent rollouts via their angular deviation from a batch consensus prototype and rectifies them with within-prompt stable alternatives. Geoalign is forward-pass only and adds negligible overhead. Across dialogue alignment with a learned reward model and mathematical reasoning with binary verified rewards, Geoalign improves final performance and reduces training oscillation, outperforming PF-PPO, PAR, PODS, and Seed-GRPO. These results suggest latent directional consensus as an effective reliability signal for online LLM RL.
[AI-34] Learning to Recover Task Experts from a Multi-Task Merged Model
链接: https://arxiv.org/abs/2606.26902
作者: Jinwook Jung,Taegyu Kim,Kumju Jo,Sungyong Baik
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-task model merging aims to consolidate several task-specific experts into a unified model, yet static merging consistently suffers from parameter interference. While dynamic merging models aim to bridge this gap, many works rely on the costly storage and loading of redundant expert components at inference. In this work, from the perspective of task expert, we view parameter interference as parameter perturbation introduced to each expert during merging process. We show that such parameter perturbations can be modeled as affine transformation, which can be approximated as additive offsets. Motivated by these, we propose Recover Task eXpert (ReTeX), a framework that predicts those offsets, in order to undo parameter interference and recover task-expert performance from a single merged checkpoint. To recover the appropriate expert when task identity is unknown, we introduce a router-free task identifier based on SVD subspace signatures computed offline before inference. At inference, the identifier selects the task whose subspace yields the smallest projection residual for a given input. As a result, ReTeX recovers over 95% of individual-expert performance in both vision and NLP domains, while significantly improving generalization to unseen tasks. Crucially, we also show that the parameter offset prediction leads to emergent adaptive interpolation of expert knowledge for out-of-distribution (OOD) tasks. ReTeX adaptively interpolates seen expert knowledge to handle unseen tasks. Our code is available at this https URL
[AI-35] Generative Retrieval via Diffusion Transformer with Metric-Ordered Sequence Training and Hybrid-Policy Preference Optimization
链接: https://arxiv.org/abs/2606.26899
作者: Chenghao Liu,Yu Zhang,Zhongtao Jiang,Kun Xu,Zhenwei An,Renzhi Wang,Zhao Wang,Jiachen Zhang,Yuxiao Zhang,Kun Xu,Songfang Huang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Embedding-based retrieval ranks items by their similarity to a query in a shared vector space and usually aims to return the highest-scoring items. In many production settings this is not what is wanted: given a seed set that expresses a fine-grained pattern, one needs more items that both satisfy a target attribute and stay within that pattern. We formalize this as pattern-preserving attribute retrieval. The two goals pull against each other: averaging the seeds preserves the pattern but stays in a low-attribute region, while global attribute retrieval drifts to unrelated patterns. We approach the task with continuous generative retrieval, where a model reads a sequence of item embeddings and generates query embeddings for nearest-neighbor search. We propose MO-DiT+HPPO, a staged framework with raw-sequence pretraining, multi-domain metric-ordered continuation pretraining, tail-centroid fine-tuning, and HPPO. Metric-ordered training turns sparse online retrieval labels into in-pattern trajectories ordered from low to high predicted attribute density, teaching one model the metric-improvement direction across domains. HPPO aligns the generated query distribution with the true online objective by labeling a hybrid candidate pool with the online intersection metric and applying reference-anchored preference optimization. A Pareto pair filter keeps only winner pairs that do not lower same-pattern purity, raising the attribute metric without sacrificing the pattern. Across four attribute domains under item- and pattern-holdout protocols, metric-ordered DiT improves the intersection metric over a pretrained generative retriever, and HPPO improves it further, with significant gains on seven of eight domain-split cells and a marginal tie on the hardest split. Metric-predictor validation, order ablations, CPT/SFT comparisons, and a candidate-policy ablation show where the gains come from.
[AI-36] A Pipeline for Generating Longitudinal Synthetic Clinical Notes Using Large Language Models
链接: https://arxiv.org/abs/2606.26879
作者: William Poulett
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Synthetic data is increasingly used to enable the development and evaluation of AI systems in domains where access to real-world data is restricted. In healthcare, clinical documentation presents particular challenges due to its sensitivity. This work introduces a synthetic clinical notes pipeline and dataset designed to support the development of clinical AI tools while avoiding the privacy risks associated with real patient data. The dataset is generated using a modular pipeline that combines structured patient generation, semi-structured patient journey simulation, and unstructured clinical note generation using large language models. The pipeline is designed to prioritise internal consistency across longitudinal patient records, while also capturing variation in writing style, note structure, and clinical detail. Additional mechanisms, including LLM-based validation and augmentation steps, are used to improve faithfulness, realism, and diversity of the generated notes. We release a dataset of 70 synthetic patients, each associated with 20-50 clinical notes spanning a full hospital journey. The dataset is provided at multiple levels of validation, enabling users to balance realism and scalability depending on their use case. This dataset supports the development, testing, and evaluation of clinical AI systems, including summarisation tools, coding models, and decision support systems, without reliance on real patient data.
[AI-37] AVR-VLM: Risk-Conditioned Causal Grounding for Hallucination-Resistant Report Generation
链接: https://arxiv.org/abs/2606.26874
作者: Zhixiang Lu,Xiwei Liu,Sifan Song,Changkai Ji,Anh Nguyen,Jionglong Su,Imran Razzak,Jinfeng Wang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Transcatheter Aortic Valve Replacement (TAVR) planning requires meticulous multimodal reasoning. However, adapting Multimodal Large Language Models (MLLMs) to this high-stakes domain is severely impeded by diagnostic hallucinations, where generated text lacks anatomical grounding. To address this, TAVR-VLM is introduced: a novel framework featuring Risk-Conditioned Causal Grounding Attention (R-CGA) that instantiates a model-internal ``Risk \rightarrow Region \rightarrow Word’’ structural grounding pathway. R-CGA compresses multimodal inputs into a causal risk bottleneck, purifying dense visual features into a global risk mask. During autoregressive generation, a support-projected causal consistency objective constrains token-level grounding within the risk-defined support mask. Evaluated on \textM^3\textTAVR , a comprehensive 1,482-patient cohort, TAVR-VLM establishes a new state-of-the-art. It achieves an AUROC of 0.896, boosts CIDEr to 0.936, and drastically reduces the hallucination rate to 8.1%, thereby improving interpretability for evidence-based surgical AI.
[AI-38] Fortress and Gatekeeper: Theorizing Transitive Trust in Third-Party Cybersecurity Risk Governance
链接: https://arxiv.org/abs/2606.26866
作者: Yijun Chen,Misita Anwar
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 21 pages, 2 Figures, 3 Tables
Abstract:Third-party vendors, such as analytics platforms, cloud services, identity providers, and software suppliers, are increasingly embedded in digital service delivery. While these arrangements enable scale and specialization, they also move customer data and security-relevant practices into environments that customers rarely see, select, or evaluate. This paper examines this problem through a document analysis of the November 2025 OpenAI-Mixpanel security incident. The incident serves as an illustrative case for showing how a security event in a vendor environment can become a governance and accountability problem for the focal organization that maintains the customer relationship. Drawing on organizational trust research and agency theory, the paper argues that third-party cybersecurity risk is both a trust relationship and a delegation problem. Customers trust the visible service provider, while the provider relies on vendors whose security practices are only partially visible and controllable. The paper develops the concept of transitive trust, where customer trust in a digital service depends on the security practices of vendors authorized by that service provider. It then presents the Fortress and Gatekeeper framework, which explains cybersecurity governance boundaries through trust and data flows rather than formal organizational ownership alone. The analysis develops four propositions concerning vendor integration, metadata exposure, vendor assurance, and data proliferation. The paper contributes to cybersecurity governance scholarship by explaining how delegated data processing creates customer-facing accountability and by identifying implications for vendor tiering, data classification, contractual design, continuous assurance, and data minimization.
[AI-39] LCAi: Life Cycle Assessment with big data fusion and retrieval-augmented generation-assisted interpretation
链接: https://arxiv.org/abs/2606.26857
作者: Georgios Tsironis,Juan D. Medrano-Garcia,Gonzalo Guillen-Gosalbez
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 14 figures, 6 tables. Includes Supplementary Information
Abstract:The interpretation phase of life cycle assessment often lacks structured mechanisms for translating quantified improvement opportunities addressing environmental hotspots into actionable strategic pathways under technological, social, and policy uncertainty. To overcome this limitation, this study introduces a perspective-conditioned retrieval-augmented generation framework for LCA interpretation, where a multi-perspective retrieval and controlled synthesis is incorporated in the artificial intelligence (AI)-assisted LCA. To operationalise large language models in LCA interpretation, a perspective fusion RAG architecture was developed, covering academic, industry, public discourse, and European union (EU) funding datasets. Our approach comprises three steps: (1) a scenario anchor defining system boundaries and decarbonization targets, (2) a set of perspective-specific micro-queries with constrained retrieval, and (3) a neutral synthesis step integrating only ledger-stored outputs without further retrieval. The framework is demonstrated through a hydrogen-enabled diesel reduction use case in an Italian apple production facility using GPT-5 nano as the reasoning model. Overall, the structured retrieval and constrained synthesis are designed to mitigate the risk of hallucination while preserving cross-domain diversity. The approach presented can support more disciplined translation of impact results into strategic pathways and opens up new avenues for the use of advanced AI tools in LCA studies, particularly those focused on technologies that could be deployed at scale. This proof-of-concept demonstrates how AI-assisted, evidence-grounded interpretation can support implementation-oriented decision-making beyond conventional LCA studies.
[AI-40] Context-Aware Synthesis of Optimization Pipelines for Warehouse Optimization
链接: https://arxiv.org/abs/2606.26852
作者: Janik Bischoff,Anne Meyer,Uta Mohring,Fabian Dunke,Maximilian Barlang,Özge Nur Subas,Hadi Kutabi,Stefan Nickel,Kai Furmans
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Order fulfillment in manual picker-to-goods warehouses involves interconnected decisions such as item assignment, order batching, and picker routing. While integrated models capture interactions between these decisions, practical warehouse systems often require decomposed approaches due to organizational boundaries, differing responsibilities, or limited data availability. Existing studies primarily evaluate algorithms for isolated subproblems or fixed subproblem combinations for specific warehouse settings, but lack a general mechanism to determine applicable algorithm configurations, compose them into valid solution pipelines, and assess their performance. With Context-Aware Synthesis of Optimization Pipelines (CASOP), we propose a framework for constructing and evaluating context-specific optimization pipelines and apply these to order fulfillment. The framework comprises: (1) a modular repository of algorithms for common order fulfillment problems; (2) semantic data and algorithm cards to describe warehouse context and algorithm requirements; (3) a taxonomy that structures order fulfillment problems into relevant subproblems; (4) a pipeline synthesizer that identifies applicable algorithms for a given warehouse context and composes all valid optimization pipelines; and (5) a pipeline evaluator that assesses all resulting pipelines. We demonstrate the framework on 7 benchmark instance sets covering four problem classes, resulting in 1,063,044 valid pipelines. The framework supports researchers and practitioners in designing, automatically synthesizing, and selecting valid, high-performing algorithmic pipelines for warehouse operations. The software is open-source and available at this https URL and this https URL. Keywords: Warehouse optimization, Algorithm selection, Pipeline synthesis, Order fulfillment Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2606.26852 [cs.AI] (or arXiv:2606.26852v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.26852 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-41] he Capability Frontier: Benchmarks Miss 82% of Model Performance
链接: https://arxiv.org/abs/2606.26836
作者: Bradley Fowler,Ryan Smith,Daniel Thi Graviet,William Myers,Joshua Greaves,Narmeen Fatimah Oozeer,Antía García,Philip Quirke,Amirali Abdullah,Fazl Barez,Shriyash Kaustubh Upadhyay
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Existing benchmarks typically report accuracy for a single model on a single run. This systematically understates real-world LLM capabilities, particularly under heterogeneous data distributions: (i) different models get different questions correct according to their specializations, and (ii) given a budget, multiple generations can be sampled and selectively retained. To quantify this gap, we introduce the Capability Frontier: a Pareto frontier over a set of models that characterizes the best achievable performance at each cost level under optimal selection across models and generations (i.e., via an oracle). Our construction corrects for two opposing biases: underestimation from single-model evaluation and overestimation from taking maxima over noisy samples. We study 21 LLMs across 16 widely used benchmarks spanning coding, reasoning, medicine, factuality, instruction following, and agentic tasks, comparing Capability Frontier performance at matched cost to each benchmark’s top-performing model. Correcting for single-model evaluation yields a 54% error rate reduction; additionally correcting for single runs yields an 82% improvement, with SOTA accuracy matched at 85% cost reduction. Complementing these empirical results, we use controlled probabilistic simulations to show that higher query topic entropy produces a near-monotonic increase in the performance gap between oracle routing and the best single model. Our findings suggest collective LLM capabilities are substantially underestimated, with implications for evaluation and deployment in data-heterogeneous, multi-domain settings.
[AI-42] Computational Analysis of Heart Rate Variability in Healthy Adults
链接: https://arxiv.org/abs/2606.26816
作者: María J. Lado,Arturo J. Méndez,Leandro Rodriguez-Liñares,Baltasar García Pérez-Schofield,Pedro Cuesta-Morales,Brais Iglesias-Otero,Xose A. Vila
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Heart Rate Variability (HRV) analysis is a key indicator of cardiac physiological state and aids in disease diagnosis. However, research on HRV parameters in healthy individuals remains limited, and no gold standard exists. This study evaluates HRV indices in 40 healthy adults (20 men, 20 women, aged 30-50) to improve HRV’s clinical utility. Using computational methods for signal processing and data analysis, time, frequency, and nonlinear indices were analyzed to address five questions: (1) normality, (2) stability, (3) correlation, (4) reproducibility, and (5) consistency. Key findings: (1) Time-domain and nonlinear indices, particularly global and LF (low frequency), follow normal distributions, with gender differences noted. (2) Most indices are stable except HF (high frequency)-related ones. (3) High correlations in HF-related indices suggest redundancy, indicating only one is necessary in studies. (4) Comparisons with the Fantasia database revealed less than 10% error for most indices, except SD2 and SDNN in women (greater than 15%). (5) Time-domain and nonlinear indices show low inter-study variability, while frequency-domain indices exhibit high variability, limiting cross-study comparisons. The selected indices-ApEn and IRRR (global variability), HRVi and SD2 (LF), and MADRR or rMSSD (HF)-are best suited for accurately representing HRV components and enhancing its clinical and research relevance.
[AI-43] Memory Depth Not Memory Access: Selective Parametric Consolidation for Long-Running Language Agents
链接: https://arxiv.org/abs/2606.26806
作者: Haoliang Han
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Main paper with supplementary material included as ancillary file
Abstract:Long-running language agents need more than memory access. Retrieval systems can fetch past facts at query time, but they do not decide which experiences should continue to shape behavior after the working context is unloaded. We study this separate problem as memory depth: durable goal-conditioned tendencies written into a small parametric store. We introduce the loop-drift protocol, a controlled stress test in which the retrieval index remains intact while working context is unloaded and goal-conditioned behavior must persist under long-loop interference. We evaluate EVAF, a surprise- and valence-gated LoRA consolidation mechanism. Across GPT-2 and TinyLlama, retrieval is strongest on shallow factual recall (short-fact accuracy 0.956–0.973), while EVAF is strongest on goal persistence and post-unload recovery (0.812–0.904) with only 2–3 parametric writes per 200 events. Mechanism controls show that selective consolidation factorizes into two controllable dimensions: selection and actuation. Matched random gates isolate selection beyond sparse writing; fixed-inner controls across GPT-2, TinyLlama, and Mistral-7B show that inner-loop write strength is model-dependent; and a Mistral-7B matched-gate inversion reveals asymmetric selection-actuation coupling under miscalibrated actuation. Public Memora event streams serve as an external diagnostic, exposing stale-memory invalidation as an unresolved boundary. Within this probe, selective parametric consolidation supplies memory depth distinct from and complementary to retrieval access.
[AI-44] MIRROR: Novelty-Constrained Memory-Guided MCTS Red-Teaming for Agent ic RAG IJCNN2026
链接: https://arxiv.org/abs/2606.26793
作者: Inderjeet Singh,Andrés Murillo,Motoyoshi Sekiya,Yuki Unno,Junichi Suga
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 2 figures. Accepted at the 2026 International Joint Conference on Neural Networks (IJCNN 2026), IEEE WCCI 2026; presented as an oral talk. Code and ART-SafeBench benchmark: this https URL
Abstract:Multimodal agentic retrieval-augmented generation (RAG) systems expand the attack surface beyond prompt injection to include text poisoning, image injection, direct-query attacks, and orchestrator-level tool manipulation. Existing red-teaming approaches are typically surface-specific and often recycle known attack templates; on text-poisoning benchmarks we measure 73-84% exact duplication. We present MIRROR, a unified cross-surface framework that performs memory-guided Monte Carlo tree search while conditioning candidate generation on retrieved context under an explicit novelty constraint. A deterministic Novelty Gate rejects any candidate matching the retrieval set under normalized comparison, allowing retrieval to inform search priors without enabling prompt copying. Across four attack surfaces on a multimodal agentic RAG target, MIRROR attains 76% ASR on image poisoning compared with 52% for baselines, 97% ASR on orchestrator attacks at half the query cost, and the lowest cross-surface variance (coefficient of variation 0.47). In contrast, specialized baselines collapse across surfaces: suffix optimization reaches 79% ASR on text poisoning but 1% on direct queries. We release ART-SafeBench with 41,815 in-package records and runtime adapters yielding 41,991+ total records across four surfaces.
[AI-45] EGG: An Expert-Guided Agent Framework for Kernel Generation
链接: https://arxiv.org/abs/2606.26758
作者: Yaochen Han,Ke Fan,Hongxu Jiang,Wanqi Xu,Weiyu Xie,Runhua Zhang,Chenhui Zhu,Yixiang Zhang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:High-performance GPU kernels are critical for reducing the exponentially growing computational costs of large language models (LLMs), but their development heavily relies on manual tuning by domain experts. While recent advances in LLM-based approaches show promise for automating kernel generation, they still struggle to achieve both correctness and high performance. This limitation primarily arises from the lack of domain-specific optimization guidance, hindering effective exploration of the optimization space. We propose EGG, an Expert-Guided Agent Framework for Kernel Generation, which incorporates expert optimization principles to guide LLMs’ decisions. Inspired by expert workflows, we decompose kernel generation into two hierarchical stages: 1) algorithmic structure design, which establishes a high-quality computational structure foundation; 2) hardware-specific tuning, which performs targeted adjustments through parallel mapping, tensor tiling, and memory optimization. This staged decomposition defines explicit optimization objectives, structuring the design space to achieve progressive refinement. To this end, a stage-aware multi-agent collaboration mechanism is designed for inter and intra-stage context management, ensuring stable optimization trajectories. Experiments on KernelBench and real-world workloads show that EGG achieves a 2.13x average speedup over PyTorch, outperforming existing agent-based and RL-based approaches.
[AI-46] Socratic agents for autonomous scientific discovery in high-dimensional physical systems
链接: https://arxiv.org/abs/2606.26722
作者: Xianrui Zeng,Pengfei Liu,Yirui Zang,Yang Shen,Fei Yu,Chunlei Yu,Minghao Liu,Yang Du
类目: Artificial Intelligence (cs.AI); Optics (physics.optics)
备注: 27 pages,5 figures
Abstract:The automation of scientific discovery has reached an inflection point. While AI systems now operate instruments, optimize parameters and generate hypotheses, most remain procedural: they execute workflows fixed by human designers. True autonomous science demands epistemic autonomy–the capacity to construct, challenge and revise physical explanations in response to evidence. Here we introduce AHOIS, a multi-agent AI scientist that embeds Socratic midwifery into closed-loop experimentation. A physics-critic agent interrogates hypotheses through causal questioning, constraint checking, counterexample generation and falsification-criteria formulation. We evaluate AHOIS on a real multimode-fibre optical platform, a high-dimensional system with complex wave transformations, indirect detection, environmental drift and multi-modal acquisition. Without prior encoding schemes, classifiers or speckle models, the system autonomously proposed and validated a random-interference encoding hypothesis, discovered task-adaptive sparse-measurement strategies, diagnosed distinct failure modes (encoding instability, fluorescence contamination and detector noise) and translated a published imaging protocol into an executable workflow on a non-original configuration. The discovered encoding yielded 16x16 measurements with effective rank 56.9 and classification accuracies of 76.97% on MNIST and 83.17% on Fashion-MNIST. Ablations show that Socratic interrogation improves physical consistency, hypothesis completeness, uncertainty calibration and experimental-plan validity. These results establish a route from workflow automation towards evidence-grounded, self-correcting autonomous discovery in complex physical environments.
[AI-47] LithoDreamer: A Physics-Informed World Model for Multi-Stage Computational Lithography
链接: https://arxiv.org/abs/2606.26713
作者: Yuqi Jiang,Yumeng Liu,Zimu Li,Jinyuan Deng,Qian Jin,Yucheng Cui,Yu Li,Xunzhao Yin,Qi Sun,Cheng Zhuo
类目: Artificial Intelligence (cs.AI)
备注: Correspondence to: Qi Sun qisunchn \at zju \dot edu \dot cn
Abstract:As semiconductor technology nodes scale, computational lithography is essential for ensuring yield and performance. However, lithography is a continuous physical process involving mask optimization, optical imaging, resist exposure, and development, which existing models fail to capture. To overcome this limitation, we present LithoDreamer, the first physics-informed World Model (WM) framework for computational lithography, which formulates the ``Layout-Mask-Resist Image-After Development Image (ADI)‘’ pipeline as a decision-driven multi-step evolution system. LithoDreamer captures feature changes between adjacent states to model stage-specific physics-informed latent spaces, in which it controls process intervention exploration and drives subsequent state transitions. To achieve interpretable intervention optimization without continuous supervision, we propose a contrastive variational optimization paradigm that contrasts the latent differences between intervention paths with variational evolution constraints, guiding the model to generate evolutions consistent with real lithography physics. Experiments show LithoDreamer achieves state-of-the-art performance in forward evolution and inverse planning. Our lithography dataset is publicly available at GitHub (this https URL).
[AI-48] Kalman Prototypical Networks for Few-shot Fault Detection in Combined Cycle Gas Turbines
链接: https://arxiv.org/abs/2606.26710
作者: Mohammed Ayalew Belay,Lucas Ferreira Bernardino,Adil Rasheed,Rubén M. Montañés,Pierluigi Salvo Rossi
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Combined-cycle gas turbines (CCGTs) play a key role in modern power generation, offering both high efficiency and reduced environmental impact. However, their complex thermo-fluid and mechanical interactions complicate fault detection, particularly when labeled fault data are scarce. In this paper, we introduce the Kalman Prototypical Network (KPN), a metric-based few-shot learning (FSL) framework specifically tailored for CCGT fault diagnosis. We model the evolution of class prototypes as latent stochastic states in a dynamic system to reduce episodic variance and improve robustness in embedding representation. Synthetic data sets generated with a high-fidelity Modelica-based dynamic simulation of an offshore CCGT system were used, simulating both normal operation and progressive leak faults under transient conditions. Application of the proposed framework on simulated leak fault detection tasks demonstrate that KPN outperforms conventional FSL methods such as Matching Networks, Relation Networks, and MAML in both accuracy and stability under varying support and query configurations. The proposed framework significantly improves training convergence and generalization by stabilizing class representations, making it well-suited for real-world CCGT fault detection where labeled data is limited.
[AI-49] Algorithmic Foundations of Deep Learning: Complexity-Theoretic Rates and a Characterization of Universal Approximation
链接: https://arxiv.org/abs/2606.26705
作者: Anastasis Kratsios,Simone Brugiapaglia,Bum Jun Kim,Gregory Cousins,Haitz Sáez de Ocáriz Borde
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Numerical Analysis (math.NA)
备注: 27 Main Body, 48 Page Proofs, 9 Figures
Abstract:Feedforward neural network (NN) expressivity is typically studied by emulating optimal basis-expansion schemes. While powerful, this perspective is incomplete: it primarily captures complexity through regularity, and therefore does not distinguish intuitively simple and complicated objects with comparable regularity, such as the square-root function and a typical Brownian path. The guiding message is that neural networks should be viewed not only as flexible basis functions, but also as models of computation. If a function is computable by a real-valued circuit over a prescribed elementary gate language, then it can be computed to comparable accuracy by an NN with explicit depth, width, and non-zero-parameter bounds controlled by the depth, width, gate count, and gate structure. Thus, neural-network complexity is not governed by regularity alone, but also by algorithmic complexity. We then show that any definable NN model satisfying a natural parallelization condition, allowing possibly multivariate non-linearities such as attention or layer normalization, is a universal approximator if and only if it contains a non-affine nonlinearity. The scope of our theory is illustrated by deducing universal approximation guarantees for continuous functions, minimax-optimal approximation guarantees for Besov classes, logarithmic-error complexity for holomorphic functions, and by showing that NNs can emulate numerical algorithms such as Newton-Raphson root finding and power iteration without architecture-specific arguments. Its precision is illustrated by shortest-path computation on k -vertex graphs: compiling the tropical dynamic-programming circuit yields NNs with O(log(1/\epsilon)) non-zero parameters, exponentially improving in 1/\epsilon over the generic O(\epsilon^-c k^2) Lipschitz-approximation scale, for a constant c0. Comments: 27 Main Body, 48 Page Proofs, 9 Figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Numerical Analysis (math.NA) MSC classes: 68T07, 41A46, 68Q06, 68Q25, 41A25, 03C64, 65D15 ACMclasses: F.1.3; F.2.1; G.1.2; I.2.6 Cite as: arXiv:2606.26705 [cs.LG] (or arXiv:2606.26705v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.26705 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-50] Learning Motion Feasibility from Point Clouds in Cluttered Environments
链接: https://arxiv.org/abs/2606.26700
作者: Sajid Ansari,Arthi,Girish Varma,Antony Thomas
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Motion feasibility prediction plays a central role in robotics, particularly in task and motion planning and manipulation. A major bottleneck for this problem in cluttered environments is that infeasible planning attempts by Sampling-based motion planners (SBMPs) can incur substantial computational cost. Also existing approaches for infeasibility certification are limited to low-dimensional configuration spaces and often assume simplified geometric environments represented by primitive objects with known parameters. We study the complementary problem of learning motion feasibility prediction directly from raw RGB-D observations for a 7-DOF manipulator operating in realistic cluttered scenes. We introduce the first large-scale benchmark for this setting, comprising 2.7M grasp feasibility labels over 88 scanned objects and 190 cluttered tabletop scenes. We benchmark three representative classifier families spanning MLP- based, volumetric-CNN, and point-cloud-based Transformer architectures under matched training conditions. Our best model, GRASPFC-PTX (a point-cloud transformer), achieves an AUROC of 0.996 on Novel objects while providing predictions significantly faster than SBMPs.
[AI-51] NebulaExp-8B: An Empirical Post-Training Pipeline via Full-Scale Ablation Research
链接: https://arxiv.org/abs/2606.26671
作者: Qiaobo Hao,Yangqian Wu,Shunyi Wang,Zhongjian Zhang,Ziqun Li,Yayin He,Muqing Li,Chen Zhong
类目: Artificial Intelligence (cs.AI)
备注: 29 pages, 8 figures
Abstract:Post-training alignment determines the reasoning and human preference following capabilities of large language models, yet most existing works withhold detailed data construction, filtering rules and training recipes, which hinders community reproducibility and lightweight model optimization. This work presents NebulaExp, a fully transparent, ablation-driven post-training pipeline built on Qwen3-8B-base, covering two orthogonal model branches: general instruct model and complex reasoning-specialized model. We curate a raw corpus of 3.84M multi-source SFT samples and a 200K verifiable RL candidate pool, and design an end-to-end data processing stack including response distillation, multi-dimensional cross-verification filtering, fine-grained difficulty grading, task classification and diversity-aware sampling. For the Instruct branch, our three-stage optimized supervised fine-tuning approach NebulaExp-Ins-SFT improves the average benchmark score from the 55.01 baseline of Qwen3-8B-nothink to 60.99. GRPO reinforcement learning then further elevates the average score to 61.85. For the Reasoning branch, medium-difficulty GRPO RL improves average reasoning score from 73.88 to 75.17. To address RL’s dependency on task verifiers, we systematically investigate single-teacher and multi-teacher OPD (MOPD): utilizing merely 4K instruction-following samples and outperforms RL baseline by 3.26 points on IFEval with +4.43 average overall gain; MOPD fuses four domain-specialist teachers with merely 10K samples, lifting average performance by 4.18 over the base model. This report provides a fully reproducible empirical post-training recipe for 8B-scale LLMs, and comprehensively dissects the capability trade-offs among instruction adherence, mathematical reasoning, code generation and general knowledge.
[AI-52] SKILL-DISCO: Distilling and Compiling Agent Traces into Reusable Procedural Skills
链接: https://arxiv.org/abs/2606.26669
作者: Zhongxin Guo,Danrui Qi,Hanwen Gu,Peng Cheng,Yongqiang Xiong
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Agents often repeatedly solve similar task instances from scratch, leading to unnecessary reasoning cost and long execution traces. Prior work has explored workflow reuse and executable skill induction, but it remains unclear which task scenarios admit procedural skills and how the shared procedural structure should be represented across successful traces. We study this problem in FSM-defined scenarios, where successful traces can be viewed as paths in an unknown transition graph, and formulate procedural skills as reusable parameterized control-flow subgraphs. Based on this view, we introduce SkillDisCo, a distillation-and-compilation framework that distills reusable PFSM subgraphs from successful traces and compiles them into callable, executable, and verifiable procedural skills. Experiments on ALFWorld and WebArena show that SkillDisCo improves success rates and reduces agent turns across benchmarks and model scales, demonstrating the benefits of representing shared experience as reusable execution structures.
[AI-53] GHE: Template-based Graph Homomorphic Encryption for Privacy-Preserving GNN Inference in Edge-Cloud Systems
链接: https://arxiv.org/abs/2606.26664
作者: Ngoc Bao Anh Le,Thai T. Vu,John Le,Heath Cooper,Jun Shen
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 7 pages, 3 figures, 3 tables. Accepted at IEEE ICWS 2026
Abstract:Existing homomorphic encryption (HE)-based GNN systems adopt a graph-centric paradigm that couples per-query cost to global graph size, limiting evaluations to at most ~20k nodes and making them incompatible with dynamic, large-scale financial graphs. We propose TGHE (Template-based Graph Homomorphic Encryption), an ego-centric framework that resolves this by exploiting a template phenomenon: local computation trees in transaction graphs converge into a small set of structural shapes. TGHE canonicalizes ego-graphs at the edge and packs structurally identical trees into shared CKKS ciphertexts for SIMD-parallel encrypted inference, with two long-tail optimizers (Approximate Template Fitting and Topology Collapse) ensuring full SIMD coverage. On DGraphFin (3.7M nodes, 4.3M edges), TGHE-Collapse achieves a 66.9x speedup over the sequential encrypted baseline with less than 0.002 AUC loss.
[AI-54] Zero-Shot Size Transfer for Neural ODEs on Sparse Random Graphs: Graphon Limits and Adjoint Convergence
链接: https://arxiv.org/abs/2606.26662
作者: Mingsong Yan,Zhida Wang,Sui Tang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Numerical Analysis (math.NA); Optimization and Control (math.OC)
备注:
Abstract:Graph Neural Differential Equations (GNDEs) model continuous-time graph dynamics by parameterizing Neural ODE velocity fields with Graph Neural Networks. Their local, size-independent filters suggest a zero-shot size-transfer principle: train on a small graph and deploy on larger, similar graphs without retraining. We develop a quantitative theory for this principle on sparse random graphs sampled from graphons. We consider Graphon Neural Differential Equations (Graphon-NDEs) and adjoint Graphon-NDEs as the infinite-node limits of the forward and adjoint GNDE systems, and establish well-posedness. For an n -node random graph with sparsity parameter \alpha_n , we prove trajectory-wise convergence of GNDE solutions to Graphon-NDE solutions at rate O((\alpha_n n)^-1/2) , up to logarithmic factors, with high probability. We also establish uniform-in-time convergence bounds for adjoint systems governing hidden-state and parameter gradients. We further study discretize-then-optimize (DTO) and optimize-then-discretize (OTD) training. Under explicit Euler discretization with M steps, we show that DTO and OTD are asymptotically consistent, with hidden-state and local parameter-gradient discrepancies of orders O(1/M) and O(1/M^2) , respectively, up to sparsity and logarithmic factors. Experiments on HSBM and tent graphons support the theoretical rates, while zero-shot transfer experiments across four graphon classes demonstrate accurate deployment of learned GNDEs on larger independently sampled graphs.
[AI-55] LAMP: Lane-Aligned Motion Primitives for Feasible Trajectory Prediction ITSC2026
链接: https://arxiv.org/abs/2606.26661
作者: Sangjin Han,Hoseong Jung,Jeongtae Her,Changhyun Choi,H. Jin Kim
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: IEEE ITSC 2026, 6 pages
Abstract:Motion forecasting is essential for autonomous driving systems to enable safe decision-making and planning in complex driving scenarios. While existing predictors excel at minimizing standard displacement errors, they often overlook the adherence to lane topology of multimodal predictions, particularly for lower-probability modes. Consequently, predicted trajectories may violate physical and logical constraints, making the prediction set unreliable for safety-critical planning. In this paper, we propose LAMP (Lane-Aligned Motion Primitives), a topology-aware forecasting framework that anchors multimodal prediction to structured motion primitives aligned with lane topology. Specifically, we use a VQ-VAE to learn shape-aware motion primitives as discrete intention queries, capturing spatiotemporal patterns beyond endpoint-based intentions. We further introduce a feasibility-aware intention selector trained with a lane-topology prior for filtering unreachable intention queries, guiding the decoder to prioritize topology-consistent intentions while preserving behavioral diversity. Extensive experiments on the Argoverse 2 dataset demonstrate that LAMP achieves prediction accuracy comparable to state-of-the-art baselines while outperforming them in feasibility and diversity metrics.
[AI-56] Autoformalization of Agent Instructions into Policy-as-Code ICML2026
链接: https://arxiv.org/abs/2606.26649
作者: Adam Mondl,Matthew Maisel,John H. Brock
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted at the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD), ICML 2026
Abstract:Agent safety in high-stakes domains requires formal policy enforcement, but most existing approaches either rely on probabilistic guardrails (fine-tuned classifiers, prompt-based steering) that offer no formal guarantees, or on hand-coded symbolic enforcement that does not scale to the breadth of real policy specifications. We present an autoformalization pipeline that translates agent prompts, MCP tool descriptions, and natural language policy documents into formally verified policies using an LLM-based generator-critic loop. The resulting policies are written in the Cedar Policy Language. On the MedAgentBench benchmark, our autoformalized policies cover substantially more of the source natural-language specification than the hand-coded symbolic enforcement in prior work.
[AI-57] Agents That Know Too Much: A Data-Centric Survey of Privacy in LLM Agents
链接: https://arxiv.org/abs/2606.26627
作者: Nada Lahjouji,Ashwin Gerard Colaco
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 17 pages, 4 figures, 7 tables
Abstract:Large language model agents increasingly query databases, search document collections, call external APIs, remember past interactions, and act on a user’s behalf. As they move from answering questions to operating over sensitive data, privacy becomes harder to enforce. An agent touches many data sources, runs multi-step workflows, keeps state across sessions, and acts with delegated permissions. Sensitive information can therefore leak not only through its final answer but through the queries it issues, the intermediate results it handles, the memory it writes, and the messages it exchanges with other agents. We survey the privacy of LLM agents from a data-centric view, organizing the field around the data an agent touches rather than by attack type, and we use data agent as shorthand for an LLM agent that works with data. Research on these risks is active but scattered across retrieval-augmented generation, text-to-SQL interfaces, agent memory, prompt injection, access control, and contextual privacy. This survey brings that work together: we taxonomize the data sources an agent touches, the privacy risks each source creates, and the governance mechanisms that address them; we map the benchmarks used to measure these risks and identify what is missing; and we set out the open problems. Two findings recur: among governance mechanisms only information-flow control covers both compositional and cross-session inference leakage, the two least-protected risks; and no benchmark drives an agent across its data surfaces under one privacy policy, the instrument the field most lacks. Our goal is a reference that situates the scattered literature and gives future work a common framing.
[AI-58] Discovering Millions of Interpretable Features with Sparse Autoencoders
链接: https://arxiv.org/abs/2606.26620
作者: XinYang He,Wei Wang,Bing Zhao,Xuan Ren,WenBo Li,WeiXu Qiao,Hu Wei,Lin Qu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Sparse autoencoders (SAEs) have emerged as a powerful tool for decomposing superposed language model representations into sparse and interpretable features. However, training SAEs is computationally expensive, and available open-source SAE models remain limited. In this work, we introduce \textbfQwen3-Instruct SAE, a comprehensive suite of SAEs trained on the Qwen3 instruction-tuned model family, covering Qwen3-1.7B, Qwen3-4B, and Qwen3-8B. For Qwen3-1.7B and Qwen3-4B, we train layer-wise SAEs at three key activation sites: residual streams, MLP outputs, and attention outputs. For Qwen3-8B, we train SAEs on a subset of residual stream layers. We systematically evaluate these SAEs using both activation-level reconstruction metrics and model-level recovery metrics, revealing distinct sparsity–fidelity trade-offs across layers and components. Finally, we demonstrate the utility of Qwen3-Instruct SAE through a refusal-steering case study, showing that selected SAE features can causally steer instruction-tuned Qwen3 models toward refusal behavior. Our release provides a practical resource for studying sparse representations, feature-level mechanisms, and behavioral interventions in instruction-tuned language models
[AI-59] LLM -based Models for Detecting Emerging Topics in Service Feedback
链接: https://arxiv.org/abs/2606.26595
作者: Mahsa Tavakoli,Ruth Bankey,Cristián Bravo
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Enhancing the analysis of service feedback is essential for public sector organizations, particularly tax administrations, where trust and compliance depend on fair and effective service delivery. As feedback volumes grow, identifying emerging service quality issues and potential disparities across diverse populations becomes increasingly challenging. Traditional approaches often rely on manual review or static expert-defined indicators, limiting scalability and the ability to capture complex patterns in textual feedback. This paper presents a novel methodology that integrates large language models (LLMs), statistical techniques, and human-AI collaboration to improve multilingual customer feedback analysis. The primary objective is to detect emerging service quality topics that may also reveal potential inequities in service delivery. Our framework combines fine-tuned, quantized LLMs with expert oversight to produce accurate, computationally efficient, and context-aware analyses. The proposed approach was evaluated using similarity analysis and assessments from experienced tax officers, demonstrating stronger alignment with expert judgments than baseline models. By incorporating a human-in-the-loop framework, the methodology reduces LLM fabrication while improving the reliability and relevance of generated insights. The results demonstrate the practicality of combining LLMs with human expertise to support scalable, evidence-based decision-making in public sector organizations. This work contributes to the development of responsible AI systems that enhance service quality, responsiveness, fairness, and public trust through more effective analysis of multilingual customer feedback. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.26595 [cs.AI] (or arXiv:2606.26595v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.26595 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mahsa Tavakoli [view email] [v1] Thu, 25 Jun 2026 04:46:11 UTC (399 KB)
[AI-60] Content-Based Smart E-Mail Dispatcher Using Large Language Models
链接: https://arxiv.org/abs/2606.26593
作者: K. Paramesha,K R Sriram,Sujan Shetty,Shamanth Kishore,R.Tejaswini
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Email communication has become an integral part of personal and professional life, but handling its vast volume is still a significant issue for large organisations. Manual perusal of emails and forwarding their contents and attachments to intended recipients using other instant messaging platforms has proved to be error-prone and time-consuming leading to losses in terms of productivity and creating undue stress. The main objective of this paper is to explore an alternative mechanism that is to automate the task of dispatching emails based on their contents to the respective WhatsApp groups of students of various semesters of programs in an engineering college, facilitating a smooth flow of information from one end to another end in an organisation. The dispatcher system is built using agents querying large language models (LLMs) to enable it to analyze the contents of emails and route them to the relevant groups of students for their information and consumption. The system harnesses the capabilities of LLMs in analysing the textual contents for decision-making. With a well-structured agent framework prompt that includes email content as input with instructions and context, the system figures out the relevant groups to which the email message is dispatched, thus providing the required information on time. The proposed system does not rely on labelled datasets and provides several benefits, including enhanced productivity and a reduction in the cognitive load associated with reading emails.
[AI-61] SharQ: Bridging Activation Sparsity and FP4 Quantization for LLM Inference
链接: https://arxiv.org/abs/2606.26587
作者: Haoqian Meng,Yilun Luo,Yafei Zhao,Wenyuan Liu,Huaqing Zheng,Xindian Ma,Peng Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 4 figures
Abstract:Low-bit floating-point formats and semi-structured sparsity are increasingly supported by modern accelerators, yet combining them for LLM activation compression remains challenging: activations contain input-dependent outliers that dominate block scales in FP4 quantization, and directly applying N:M sparsity masks discards moderate values, coupling sparsification loss with quantization error. We introduce SharQ, a training-free inference method that bridges activation sparsity and FP4 quantization through an online sparse–dense decomposition. For each activation tensor, SharQ generates an input-adaptive N:M mask to extract an outlier-dominated sparse backbone, quantizes it to FP4, and defines a dense residual relative to the quantized sparse backbone rather than the unquantized sparse values. A sparse FP4 GEMM processes the backbone while a dense FP4 GEMM compensates for both mask-induced activation loss and sparse-path quantization error. The two paths share a single FP4 weight payload with path-specific scale views, and a fused preparation kernel absorbs mask generation, residual construction, and layer normalization into one operator. SharQ requires no calibration data, retraining, or model-specific tuning. Evaluated on Llama-3.1-8B, Qwen2.5-7B, Qwen3-30B-A3B, and Qwen3-VL-8B, SharQ recovers 43–63% of the NVFP4-to-FP16 accuracy gap across language and vision-language tasks, and generalizes across NVFP4, HiF4, and MXFP4 formats. On an RTX 5090, SharQ delivers 2.2–2.4 \times latency reduction over FP16 and 1.2–1.4 \times throughput improvement over FP8 in language model serving, and up to 1.58 \times speedup on Wan2.2-T2V-A14B video generation when combined with SageAttention. Our code is available at this https URL.
[AI-62] A Multi-Level Validation and Traceability Framework for AI-Generated Telescope Scheduling Decisions
链接: https://arxiv.org/abs/2606.26585
作者: Hengchu Xiao,Chuanjun Wang
类目: Artificial Intelligence (cs.AI); Instrumentation and Methods for Astrophysics (astro-ph.IM)
备注: 25 pages, 8 figures, Published in Universe
Abstract:With the gradual introduction of AI into telescope scheduling, AI-based decision-making has shown advantages in handling complex multi-constraint problems. However, its outputs often suffer from inconsistent data references, reasoning errors, and non-executable decisions, limiting applicability in high-reliability observational tasks. In this work, we propose a multi-level validation and traceable reasoning framework that performs systematic reliability verification of AI-generated decisions prior to execution, and enables explicit representation of the reasoning process to support traceable decision-making. The framework integrates data reference validation, logical consistency checks, and observational and instrumental constraint verification to filter and correct invalid decisions. It also introduces atomic reasoning units and their dependency relationships, representing scheduling decisions as a sequence of interconnected reasoning steps that support error localization and post hoc analysis. Experiments show that the framework improves executability and reliability of AI scheduling and reduces loss of transient opportunities. In particular, feedback correction and structured validation of reasoning steps enhance the ability to repair and block erroneous decisions, especially in complex scenarios. Compared with pure AI methods, the framework-enhanced approach maintains flexibility while substantially improving reliability and executability. These results demonstrate a feasible and verifiable pathway for applying AI to high-reliability astronomical observation scheduling.
[AI-63] EvoOptiGraph: Weakness-Driven Coevolution via Graph-Based Structural Generation for Optimization Modeling
链接: https://arxiv.org/abs/2606.26578
作者: Qingcan Kang,Mingyang Liu,Xiaojin Fu,Shixiong Kai,Tao Zhong,Mingxuan Yuan
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Automating optimization modeling from natural language with large language models (LLMs) faces two key challenges. First, training corpora lack structural diversity. Second, data generation pipelines remain static and decoupled from model learning. To address these challenges, we propose EvoOptiGraph, a novel framework where data and model co-evolve, driven by model weaknesses. EvoOptiGraph represents each mixed-integer linear program (MILP) as an attributed bipartite graph and applies validity-preserving evolutionary operators to generate structurally diverse instances. The evolved graphs are converted into solver code and natural language via deterministic compilation and verified back-translation. Training proceeds in two stages: supervised fine-tuning (SFT) on an initial dataset, followed by reinforcement learning with verifiable rewards (RLVR), where graph-derived weakness signals guide the generation of new instances targeting the model’s failures. This forms a closed loop that continuously updates the training distribution. Empirical results on six public datasets show that EvoOptiGraph significantly outperforms larger generalist models, agentic methods, and specialized baselines in accuracy, executability, and generalization. These results demonstrate that targeted data-model coevolution is an effective strategy for improving LLMs on optimization modeling tasks.
[AI-64] IDEA: Insensitive to Dynamics Mismatch via Effect Alignment for Sim-to-Real Transfer in Multi-Agent Control
链接: https://arxiv.org/abs/2606.26575
作者: Chenlong Liu,Zhuohui Zhang,Xinyan Chen,Zhipeng Wang,Bin Cheng,Bin He
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures
Abstract:Complex multi-agent control tasks remain challenging for traditional rule-based and model-based approaches, motivating the adoption of learning-based methods. However, learning-based methods often struggle with sim-to-real transfer because they rely on accurate dynamics modeling or system identification and learn policies in low-level control spaces that are highly sensitive to dynamics mismatch, making them costly and fragile in complex environments. To address this issue, we propose a sim-to-real method for multi-agent control, which is insensitive to dynamics mismatch via effect alignment. Our method combines random environmental structure with discrete semantic actions through closed-loop control, elevating policy learning to a semantic abstraction level. Additionally, we develop an action synchronization mechanism that mitigates inter-agent action timing mismatches, thereby enhancing the temporal consistency of the system. Experiments on four multi-agent navigation tasks demonstrate that our method substantially improves training efficiency over mainstream transfer methods and achieves higher success rates in real-world scenarios, thereby improving the robustness and deployment stability of multi-agent systems under dynamics mismatch.
[AI-65] Explainable Ensemble-Based Machine Learning Models for Detecting the Presence of Cirrhosis in Hepatitis C Patients
链接: https://arxiv.org/abs/2606.26561
作者: Abrar Alotaibi,Lujain Alnajrani,Nawal Alsheikh,Alhatoon Alanazy,Salam Alshammasi,Meshael Almusairii,Shoog Alrassan,Aisha Alansari
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Hepatitis C is a liver infection caused by a virus, which results in mild to severe inflammation of the liver. Over many years, hepatitis C gradually damages the liver, often leading to permanent scarring, known as cirrhosis. Patients sometimes have moderate or no symptoms of liver illness for decades before developing cirrhosis. Cirrhosis typically worsens to the point of liver failure. Patients with cirrhosis may also experience brain and nerve system damage, as well as gastrointestinal hemorrhage. Treatment for cirrhosis focuses on preventing further progression of the disease. Detecting cirrhosis earlier is therefore crucial for avoiding complications. Machine learning (ML) has been shown to be effective at providing precise and accurate information for use in diagnosing several diseases. Despite this, no studies have so far used ML to detect cirrhosis in patients with hepatitis C. This study obtained a dataset consisting of 28 attributes of 2038 Egyptian patients from the ML Repository of the University of California at Irvine. Four ML algorithms were trained on the dataset to diagnose cirrhosis in hepatitis C patients: a Random Forest, a Gradient Boosting Machine, an Extreme Gradient Boosting, and an Extra Trees model. The Extra Trees model outperformed the other models achieving an accuracy of 96.92%, a recall of 94.00%, a precision of 99.81%, and an area under the receiver operating characteristic curve of 96% using only 16 of the 28 features.
[AI-66] PMDformer: Patch-Mean Decoupling Information Transformer for Long-term Forecasting
链接: https://arxiv.org/abs/2606.26549
作者: Ao Hu,Liangjian Wen,Jiang Duan,Yong Dai,He Yan,Dongkai Wang,Jun Wang,Yukun Zhang,Ruoxi Jiang,Zenglin Xu
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Long-term time series forecasting (LTSF) plays a crucial role in fields such as energy management, finance, and traffic prediction. Transformer-based models have adopted patch-based strategies to capture long-range dependencies, but accurately modeling shape similarities across patches and variables remains challenging due to scale differences. To address this, we introduce patch-mean decoupling (PMD), which separates the trend and residual shape information by subtracting the mean of each patch, preserving the original structure and ensuring that the attention mechanism captures true shape similarities. Futhermore, to more effectively model long-range dependencies and capture cross-variable relationships, we propose Trend Restoration Attention (TRA) and Proximal Variable Attention (PVA). The former module reintegrates the decoupled trend from PMD while calculating attention output. And the latter focuses cross-variable attention on the most relevant, recent time segments to avoid overfitting on outdated correlations. Combining these components, we propose PMDformer, a model designed to effectively capture shape similarity in long-term forecasting scenarios. Extensive experiments indicate that PMDformer outperforms existing state-of-the-art methods in stability and accuracy across multiple LTSF benchmarks. The code is available at this https URL.
[AI-67] CascadeFormer: Depth-Tapered Transformers Motivated by Gradient Fan-in Asymmetry
链接: https://arxiv.org/abs/2606.26538
作者: Huzama Ahmad,Cao Viet Hai Nam,Se-Young Yun
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 8 figures, 5 tables
Abstract:Deep Transformers are composed of uniformly stacked residual blocks, yet their deepest layers often add little value. We present two efficiency methods that exploit this asymmetry. CascadeFormer tapers width with depth to match the uneven information flow across layers, achieving comparable perplexity to a uniform baseline at the same training budget while reducing latency by 8.6% and increasing throughput by 9.4%. CascadeFlow Pruning removes layers using accumulated training gradients, with no post hoc analysis. It outperforms standard heuristics on perplexity and rank-stability and stays competitive on downstream accuracy. To motivate these methods, we propose Gradient Fan-in Asymmetry (GFA) as a structural account of why deeper layers contribute less. In Pre-LayerNorm residual stacks, the gradient at a layer is the sum of an identity path and all downstream functional paths, producing a gradient fan-in that decays linearly with depth (and quadratically under deep supervision), yielding richer gradients for early layers and sparser ones for later layers. We provide correlational and interventional evidence for GFA on models trained from scratch up to 1.2B parameters. Across Transformers and ResNets, accumulated training gradients follow the theoretical fan-in and are associated with post hoc layer importance. Two interventions point to structure rather than magnitude as the bottleneck: equalizing per-layer gradient norms does not restore late-layer value, while increasing downstream path counts via parameter-shared repetition restores and elevates it. Whether gradient magnitude proxies fan-in beyond high-rank regimes, and how these dynamics behave at the 100B+ scale, remain open questions.
[AI-68] VoiceTTA: Enhancing Zero-Shot Text-to-Speech via Reinforcement Learning-Based Test-Time Adaptation INTERSPEECH2026
链接: https://arxiv.org/abs/2606.26534
作者: Tianxin Xie,Chenxing Li,Dong Yu,Li Liu
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 5 pages, accepted to Interspeech 2026
Abstract:Recently, zero-shot text-to-speech (TTS) has enabled high-fidelity and expressive speech synthesis, but it often fails to imitate unseen speaking styles from uncommon scenarios (e.g., crosstalk, dialects). Moreover, fine-tuning pretrained models requires large, high-quality datasets, limiting rapid personalization. We propose VoiceTTA, a reinforcement learning-based test-time adaptation (TTA) method that improves voice imitation of pretrained zero-shot TTS models. VoiceTTA introduces two style rewards based on coefficient-of-variation differences of F0 and energy, combined with speaker similarity and intelligibility (WER from a pretrained Whisper model), and optimizes learnable prefixes via group relative preference optimization (GRPO) in a flow matching-based model at inference time. Extensive experiments demonstrate substantial improvements on uncommon speech prompts, outperforming state-of-the-art baselines. Audio samples are available at this https URL
[AI-69] Radical AI Interpretability
链接: https://arxiv.org/abs/2606.26523
作者: Daniel A. Herrmann,Benjamin A. Levinstein
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Draft of manuscript to appear as Cambridge Element in the Philosophy of Artificial Intelligence
Abstract:We develop a framework for interpreting AI systems as agents, drawing on the philosophical tradition of radical interpretation and the tools of mechanistic interpretability. The core question is: given the computational facts about a system, how do we solve for its beliefs, desires, and meanings? This matters increasingly for safety. We want to be able to trust the systems we deploy, whether by understanding their goals or, more modestly, by reliably detecting deception. Interpretability researchers are building tools to read beliefs and desires off a model’s internals, but there is no settled account of when such a tool has succeeded. This book supplies one. We propose criteria on both representationalist and interpretationist approaches, and tie each to tests current interpretability methods can carry out. A central lesson is that these attributions cannot be made piecemeal. Beliefs, desires, and the propositional structure they presuppose are jointly constrained, and a method that fixes one while measuring the others inherits whatever distortions that introduces. This holism becomes pressing for AI systems, which may not share the interpreter’s concepts. However, it also provides leverage: a system’s attitudes constrain its propositional structure, that structure constrains which attitudes can be attributed, and mechanistic interpretability can help us measure both.
[AI-70] Multipath Adaptive Gated Bottleneck Latent ODE with Raman Data Fusion for Cell Culture Process Forecasting
链接: https://arxiv.org/abs/2606.26520
作者: Johnny Peng,Thanh Tung Khuat,Ellen Otte,Katarzyna Musial,Bogdan Gabrys
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Mammalian cell-culture processes underpin the manufacture of many biopharmaceuticals, yet keeping a run on track is hard: critical process parameters drift over days, and an off-specification trend is often confirmed too late to intervene. Early-stage, multi-day forecasts could enable timely adjustment of feeding, sampling, and control, but bioprocess forecasting is challenging because measurements are sparse and irregularly sampled, operating conditions are heterogeneous across cell lines and media, and runs with near-identical early behaviour can diverge into different futures. We propose an adaptive framework combining a Gated Bottleneck Latent Ordinary Differential Equation (GB-Latent ODE) with Multi-Path Just-In-Time Fine Tuning (MP-JIT-FT). The GB-Latent ODE augments the stan dard Latent ODE with learnable variable-wise gating and a mask-aware bottleneck that compress high-dimensional sparse inputs, improving learning under limited data. Given a partially observed run, MP-JIT-FT retrieves similar historical trajectories, clusters the local neighbourhood into candidate regimes, and fine-tunes a separate model per regime to produce multiple plausible paths, each with a reconstruction-based confidence score, not a single averaged forecast. We further fuse Raman spectroscopy data: a machine-learning soft sensor turns dense Raman spectra into pseudo-observations that enrich the sparse offline measurements for more robust training. On 38 fed-batch 5L bioreactor runs spanning 14 conditions, MP-JIT-FT with Raman fusion achieves the best average rank and outperforms a global Latent ODE baseline on 8 of 9 target variables. Using local-divergence metrics, we show the multi-path gains are largest when locally similar prefixes diverge, whereas Raman fusion helps most when early dynamics are representative of later behaviour.
[AI-71] Boundary-Aware Context Grounding for A Low-Channel EEG Agent
链接: https://arxiv.org/abs/2606.26519
作者: Zhiyuan Xu,Yueqing Dai,Junling Li,Junwen Luo
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 6 figures
Abstract:Large language models (LLMs) can make scientific software easier to use. However, a general model does not automatically know which measurements a particular sensor can support, which algorithms are implemented in the current software, or which conclusions are justified by a computed result. These distinctions are especially important for low-channel electroencephalography (EEG), where sparse spatial coverage and variable signal quality make plausible but unsupported interpretations easy to produce. We present NeuraDock Agent, an open-source architecture that separates a deterministic local EEG engine from a hardware-aware language layer. The numerical engine parses recordings, performs quality control, executes reviewed spectral workflows, and writes machine-readable artifacts. The LLM receives only a compact, allowlisted summary and a versioned context pack. The context describes the seven-channel hardware, reviewed workflows, result fields, implementation boundaries, scientific limits, and reference cases. Raw EEG and dense per-sample arrays remain local We evaluate the system at three levels. First, 12 recordings produced identical structured results over ten numerical repetitions, and a complete Rest/Task run produced identical result, report, and figure hashes over three repetitions. Second, request-capture and failure-injection experiments confirmed the tested data boundary and preservation of local artifacts under HTTP, malformed-output, and connection failures. Third, a boundary-awareness benchmark tested 36 ordinary and adversarial questions under four context ablations and two LLMs, yielding 288 this http URL results support hardware- and implementation-aware grounding as a practical mechanism for calibrating what an EEG agent accepts, qualifies, or refuses; they do not establish clinical validity or a validated absolute cognitive-load index. Comments: 25 pages, 6 figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.26519 [cs.AI] (or arXiv:2606.26519v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.26519 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-72] NeuraDock Visual Cognitive Load Agent Tutorial: A Quality-Gated Open-Source EEG Workflow for Alpha Dynamics and Real-Time Applications
链接: https://arxiv.org/abs/2606.26518
作者: Zhiyuan Xu,Yueqing Dai,Junling Li,Junwen Luo
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 10 figures
Abstract:This tutorial paper provides a step-by-step, reproducible walkthrough of NeuraDock Agent, an open-source EEG agent focused on Alpha dynamics and visual cognitive-load analysis. The goal is practical: a reader should be able to install the agent, run EEG preprocessing and quality control, generate Alpha dynamics figures, perform within-subject Rest/Task visual cognitive-load comparison, run the public mini-dataset analyses and compare them with the reference validation summary, start an online dashboard, call the real-time API from an external application, and use the LLM interpretation layer to explain quality risks. Existing EEG toolkits provide excellent offline analysis, but assembling a real-time, quality-gated cognitive-load pipeline often requires manually bridging acquisition, custom QC, Alpha feature extraction, and a web API; this tutorial closes that offline-to-online gap. The tutorial uses a quality-gated workflow: downstream Alpha and workload metrics are computed only after preprocessing and QC gating rather than directly from raw EEG. In the included mini-dataset validation, the agent processed 18 recordings, generated 10 within-subject comparisons, observed task-related posterior Alpha suppression in 7 of 10 contrasts, estimated initial evidence of within-subject repeatability, and benchmarked local online API latency. The tutorial is intended for researchers, developers, and applied teams who want a transparent path from EEG files to real-time visual cognitive-load prototypes.
[AI-73] Clinical Harness for Governable Medical AI Skill Ecosystems
链接: https://arxiv.org/abs/2606.26494
作者: Tianhan Xu,Lei Bao,Yongxiang Wang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Medical AI remains organized around isolated models, whereas clinical care requires accountable capabilities that persist across time. We propose clinical AI skills and the Clinical Harness: a runtime governance architecture for registering, orchestrating, guarding and monitoring AI-enabled clinical capabilities. Using osteoporosis as an exemplar, we show how knowledge-driven, data-driven and physics-enhanced skills can support lifecycle care under runtime governance.
[AI-74] Evaluation-Strategy Gap in Fault Diagnosis of Deep Learning Programs
链接: https://arxiv.org/abs/2606.26492
作者: Sigma Jahan
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted research track paper in the 42nd IEEE International Conference on Software Maintenance and Evolution (ICSME 2026)
Abstract:Deep Learning (DL) programs can fail during training for many reasons, and diagnosing the cause is a costly and time-consuming maintenance task. Techniques for diagnosing such failures are commonly assessed using within-program cross-validation, which may be inadequate for deployment settings involving previously unseen programs. It is therefore necessary to assess how performance differs across these settings and to identify the causes of any performance gap in established fault diagnosis techniques for DL. We investigate this gap using DynFault, a corpus of 5,542 fault-injected training traces from 38 real-world DL programs. We found a gap of 0.190 in balanced accuracy for existing fault diagnosis techniques between within-program evaluation and holding out whole programs. We also found the gap comes from program-level structure in the features, which led us to examine two runtime feature sets, curvature features and optimizer features, and their behavior on unseen programs. We found that curvature features are useful for instability detection on unseen programs, while optimizer and activation features help only on programs seen during training.
[AI-75] An Empirical Study of LLM -Generated Specifications for VeriFast
链接: https://arxiv.org/abs/2606.26490
作者: Wen Fan,Minh Tran,Sanya Dod,Xin Hu,Marilyn Rego,Danning Xie,Jenna DiVincenzo,Lin Tan
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
备注:
Abstract:Static verification tools can assure industrial scale software, but require significant human labor to write specifications. This is particularly true of static verifiers based on separation logic (SL verifiers), which excel at verifying heapmanipulating programs, but require many complex auxiliary specifications to reason about heap structure. Recent work applies large language models (LLMs) to generate code, tests, and proofs, including specifications for verifiers, but mostly targeting non-SL verifiers. To address this gap, this paper thoroughly evaluates how well LLMs perform when prompted to generate specifications for verifying 303 C functions with the SL verifier VeriFast. We explored eight prompting approaches, ten LLMs, and three input types in two stages. Quantitative and qualitative analyses are used to assess the LLM-generated code and specifications for functional behavior, verifiability and errors. The results show that LLMs preserve functional behavior in source code and specifications (both over 91%), but achieve modest verification success (31.4%). Using Gemini 2.5 Pro and providing formal contracts lead to higher success rates in our setting. Moreover, most errors (94%) come from LLMs’ mistakes in the domainspecific knowledge of SL verifiers such as VeriFast. These findings provide guidance for optimizing LLM-generated specifications for SL verifiers.
[AI-76] Retrieval-Warmed Energy-Based Reasoning : A Five-Arm Ablation Methodology for Diffusion-as-Inference on Structured Reasoning Tasks
链接: https://arxiv.org/abs/2606.26476
作者: Libo Sun,Po-Wei Harn,Zewei Zhang,Peixiong He,Xiao Qin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 Pages, 6 Figures
Abstract:Warm-started diffusion samplers accelerate iterative inference, but it is rarely clear which part of the pipeline carries the gain. We study \textbfretrieval-warmed energy-based reasoning (RW-EBR) – an IRED energy-based diffusion model \citedu2024ired augmented with a Modern Hopfield trajectory memory – and contribute a \textbffive-arm ablation methodology (oracle, best-constant, per-query-random, shuffled, aligned) that separates three confounded effects: class-prior bias shift, stochastic warm-starting, and graph-aligned value reuse. The diagnostic decomposition is adapted from LLM-RAG evaluation \citeru2024ragchecker. On \textbfconnectivity-2 (Erdős–Rényi all-pairs reachability), the aligned-vs-shuffled-oracle swing reaches \textbf +35 ,pp balanced accuracy on a fixed 1,000-graph validation-set diagnostic, with value distribution and retrieval mechanics fixed, only per-graph alignment destroyed, while per-query random initialisation falls below cold – per-graph alignment, not bias shift or stochasticity, dominates. Yet the \emphdeployable cold-prediction pipeline misses the acceptance gate at stored-value quality. The same diagnostic logic, stopped at the key-quality screen, applied to \textbfSudoku with a task-specific key encoder produces a clean negative at a \emphdifferent component – key quality, under the current setup. The decomposition names the first blocking component on each task. The setting – graph reachability refined by an iterative diffusion sampler, with explainability of failure modes as the lens – places the work within structured and spatio-temporal reasoning.
[AI-77] Localizing RL-Induced Tool Use to a Single Crosscoder Feature ICML2026
链接: https://arxiv.org/abs/2606.26474
作者: Andrii Shportko,Shubham Bhokare,Ahmed Zeyad A Alzahrani,Bowen Cheng,Gustavo Mercier,Jessica Hullman
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted as a spotlight at the ICML 2026 Mechanistic Interpretability Workshop
Abstract:Fine-tuning through RL reshapes the internal representations of language models to enable agentic behaviors such as tool use, yet the mechanistic basis of these changes remains poorly understood. While RL substantially improves structured tool-call generation, it is unclear which features emerge, which are preserved, and whether identified features can be leveraged for retraining-free behavioral control. In this work, we show that \textitDedicated Feature Crosscoders (DFC) isolate a compact set of RL-specific features that mediate tool-calling capability in \textttQwen2.5-3B . Across a 48 -crosscoder hyperparameter sweep, encode-decode reconstruction improves the RL model’s tool correctness by +31.1 \pm 9.7 pp and passively transfers tool-calling ability to the frozen base model by +6.8 \pm 5.0 pp which we call a \textitcapability spillover . Our findings show that DFC partitioning concentrates RL-introduced capability into a minimal, steerable feature set that enables runtime behavioral control of agentic LLMs.
[AI-78] auto-psych: Automating the science of mind using agent -driven theory discovery and experimentation
链接: https://arxiv.org/abs/2606.26460
作者: Ben Prystawski,Kushin Mukherjee,Daniel Wurgaft,Linas Nasvytis,Michael Y. Li,Noah D. Goodman,Michael C. Frank
类目: Artificial Intelligence (cs.AI)
备注: 30 pages, 5 figures
Abstract:AI-based scientific automation is increasingly possible by using agents to generate hypotheses, design experiments, and analyze data. Data collection is a major bottleneck in this pipeline, however. Psychology, and computational cognitive science in particular, is well-positioned to benefit from AI experimentation because theories are often represented as code and crowdsourcing platforms enable programmatic human data collection at scale. Here, we apply automated discovery techniques to the project of generating theories in computational cognitive science, with an agent-based system collecting human data independently through crowdsourced survey experiments. As a testbed, we use a classic case study from cognitive psychology: judging which sequences of coin flips seem subjectively more random. Our system, auto-psych, uses nested agent-based discovery loops to generate explanatory theories of human behavior. The inner loop conjectures, fits, and critiques probabilistic cognitive models; the outer loop designs experiments to test these models, launches them online, and analyzes the data. This system can quickly and reliably recover ground-truth theories from synthetic data via systematic experimentation, but the nested structure is critical to model performance. Further, in three independent sequences of human experiments, the system finds theories that fit the data better than theories generated from the scientific literature. This work thus demonstrates the feasibility of automated data collection and theory discovery in computational cognitive science.
[AI-79] MKG-RAG -Bench: Benchmarking Retrieval in Multimodal Knowledge Graph-Augmented Generation KDD’26
链接: https://arxiv.org/abs/2606.26458
作者: Xiaochen Wang,Bao Hoang,Han Liu,Ting Wang,Fenglong Ma
类目: Artificial Intelligence (cs.AI)
备注: Accepted by KDD’26
Abstract:Retrieval-augmented generation (RAG) over knowledge graphs has emerged as a promising approach for grounding large language models, yet existing benchmarks largely overlook the challenges of retrieval in multimodal knowledge graph RAG (MKG-RAG). In practice, retrieval is a critical bottleneck: multimodal knowledge is heterogeneous, difficult to align across modalities, and often poorly served by retrievers designed for unstructured corpora. To address this gap, we introduce MKG-RAG-Bench, a cross-domain benchmark explicitly designed to evaluate retrieval in MKG-RAG. MKG-RAG-Bench is constructed from two multimodal knowledge graphs spanning general and medical domains, and includes carefully aligned question-answering datasets that support controlled evaluation of both retrieval and downstream generation. The benchmark is built using an LLM-based curation pipeline that filters low-utility knowledge, generates structurally grounded queries with exact supervision, and systematically covers diverse modality configurations. Through extensive experiments across representative retriever families and modality settings, we show that effective multimodal retrieval remains challenging yet crucial for end-to-end MKG-RAG performance, and that retrieval quality strongly determines generation outcomes. By isolating retrieval as a first-class evaluation target, MKG-RAG-Bench provides a principled foundation for diagnosing current limitations and advancing multimodal knowledge graph RAG systems.
[AI-80] Data-driven Machine Learning Cannot Reach Symbolic-level Logical Reasoning – The Limit of the Scaling Law
链接: https://arxiv.org/abs/2606.26454
作者: Tiansi Dong,Mateja Jamnik,Pietro Liò
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Sphere neural networks have achieved symbolic level syllogistic reasoning without training data, raising the question of where the limit of the scaling law for logical reasoning lies, i.e., whether data-driven machine learning systems can achieve the same level by increasing training data and training time. We show two methodological limitations that prevent supervised deep learning from reaching the symbolic-level syllogistic reasoning: (1) training data can not distinguish all 24 types of valid syllogistic reasoning; (2) end-to-end mapping from premises to conclusion introduces contradictory training targets between neural components for pattern recognition and logical reasoning. Beside theoretical analysis, we experimentally illustrate that Euler Net cannot achieve rigorous syllogistic reasoning. We further challenge the most recent ChatGPTs (GPT-5-nano and GPT-5) to determine the satisfiability of syllogistic statements in four surface forms (patterns): words, double words, simple symbols, and long random symbols, showing that surface forms affect the reasoning performance and that ChatGPT GPT-5 may reach 100% accuracy but still provide incorrect explanations. As empirical training processes are stopped after achieving 100% accuracy, we conclude that supervised machine learning systems will not attain the rigour of symbolic logical reasoning.
[AI-81] AXLE: A Cloud Infrastructure for Lean 4 Theorem Proving Utilities ICML2026
链接: https://arxiv.org/abs/2606.26442
作者: Jimmy Xin,Alex Schneidman,Chris Cummins,Karun Ram,Srihari Ganesh,Jannis Limperg
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: Accepted at the 3rd AI for Math Workshop, ICML 2026
Abstract:We present AXLE (Axiom Lean Engine), a cloud service for Lean 4 proof manipulation, extraction, and verification. Recent progress in AI for mathematics – reinforcement learning pipelines, agentic proving workflows, dataset curation – demands Lean 4 tooling that scales to millions of requests while remaining correct and robust; existing infrastructure offers parallel compilation but not scalable proof verification, higher-level proof manipulation, multi-version support, or per-request isolation at the throughput modern AI workflows require. AXLE provides 14 Lean 4 metaprogramming tools spanning strict proof verification, declaration metadata extraction, semantic source manipulation, deterministic proof repair and simplification, and lemma extraction. The service runs as a multi-tenant cloud deployment with per-request isolation and concurrent support for multiple Lean 4 and Mathlib versions, accessible via a Python SDK, command-line interface, web UI, MCP server, and raw HTTP API. AXLE is publicly available and free to use at this https URL and via the axiom-axle PyPI package, with no local Lean 4 installation required. It has served over 500 million requests to date and is the underlying infrastructure for Axiom Math’s proving efforts, including its 12/12 score on the 2025 Putnam competition.
[AI-82] Play2Perfect: What Matters in Dexterous Play Pretraining for Precise Assembly?
链接: https://arxiv.org/abs/2606.26428
作者: Tyler Ga Wei Lum,Kushal Kedia,C. Karen Liu,Jeannette Bohg
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 22 pages, 12 figures, 4 tables. Project page: this https URL
Abstract:Multi-fingered robots promise the speed and dexterity of human hands, yet challenging problems such as precise assembly have remained out of reach. These tasks are contact-rich, making data collection for imitation learning difficult, and sparse-reward, making direct exploration with reinforcement learning (RL) intractable. Consequently, prior work has made progress by structuring the problem with specialized grippers, tool attachments, and environment fixtures. In this work, we argue that before a robot can perfect precise assembly, it must first learn to play. We further ask the question: what factors in the process of learning to play matter for precise assembly? We propose Play2Perfect, an RL framework for task-agnostic pretraining through play on diverse objects and goals, which is then perfected on precise assembly. The goal of play is to acquire reusable manipulation priors, such as grasping, in-hand reorientation and pose reaching. Finetuning then adapts this general prior to assembly, focusing exploration on the final contact-rich, high-precision interactions needed for success. We systematically study key design choices in play pretraining, including object diversity, training objective, trajectory diversity, and goal precision. We show that our prior is 33x more sample-efficient than RL training from scratch, even when provided with dense, multi-stage rewards. We demonstrate zero-shot sim-to-real transfer, achieving 60% success on tight insertions with only 0.5 mm contact clearance, and over 50% success on long-horizon multi-part assembly and screwing.
[AI-83] CoStream: Composing Simple Behaviors for Generalizable Complex Manipulation
链接: https://arxiv.org/abs/2606.26423
作者: Haonan Chen,Yuxiang Ma,Stephen Tian,Xiaoshen Han,Wenlong Huang,Feiyang Wu,Yunzhu Li,Jiajun Wu,Edward H. Adelson,Yilun Du
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Website: this https URL
Abstract:Long-horizon, contact-rich complex manipulation tasks, such as seating a GPU into a PCIe slot, demand both millimeter high precision and out-of-the-box generalization to new tasks. Existing paradigms struggle to satisfy both: classical pipelines use brittle, task-specific interfaces to achieve high-precision control but require costly pipeline redesigns to adapt to new tasks, whereas monolithic end-to-end policies provide better generalization but lack high precision on complex, out-of-distribution tasks unless retrained with new data. Both paradigms share an implicit assumption: once a manipulation capability is acquired, it must be deployed as a rigid pipeline or monolithic whole, rather than being freely decomposed and recomposed. In this paper, we show that complex manipulation capabilities can emerge naturally from the composition of simple, independent behaviors. Rather than deploying a monolithic policy or a rigid pipeline, we propose \ourshort, a framework orchestrating foundation models and diverse sensing modalities into multiple composable core behaviors: a semantic behavior extracting spatial constraints via foundation models; a predictive behavior forecasting trajectories by tracking keypoints in imagined videos; and a reactive behavior providing high-frequency tactile and force corrections. On a shared SE(3) interface, these outputs compose by right-multiplication into a single pose command at each control step, executed by a compliant controller. We demonstrate \ourshort on 8 real-world tasks spanning everyday manipulation and precision assembly, with the strongest gains in contact-rich assembly and object transfer, and show robust recovery from manual perturbations during execution. Website: this https URL
[AI-84] Estimating Uncertainty in Classifier Performance with Applications to Large Language Models and Nested Data
链接: https://arxiv.org/abs/2606.26422
作者: Kylie Anglin
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Researchers increasingly use text classification–supervised models or large language models–to measure constructs from natural language, providing metrics such as recall and precision as evidence of their validity. Yet, though these metrics are point estimates subject to sampling variation, measures of uncertainty are inconsistently reported alongside them. Further, when they are reported, they are often estimated with methods that are not appropriate when relevant labelled datasets are small or performance is high. To increase and improve confidence interval reporting in the field, this paper evaluates confidence interval methods for performance metrics under conditions typical of social science text classification: small to moderate sample sizes, infrequent constructs, and texts nested within individuals. Across simulations, default methods such as the Wald interval and the basic percentile bootstrap are the least accurate, with coverage sometimes far below the nominal 95% level. Accuracy is improved with the use of Agresti-Coull, Wilson, Clopper-Pearson, and a novel pseudo-count regularized bootstrap (which is particularly relevant to the calculation of F1). When texts are nested within individuals, we demonstrate that adjustment for both effective N and the appropriate degrees of freedom is necessary for producing accurate analytic intervals. Among bootstrap intervals, the hierarchical bootstrap is more accurate than the cluster bootstrap when individuals produce a moderate number of texts but overly conservative when individuals produce only a few. By providing guidance to the field on appropriate interval estimation, we aim to improve the transparency of machine learning applications, and to encourage greater attention to the validation sample size at the design stage.
[AI-85] Unbiased Canonical Set-Valued Oracles Via Lattice Theory
链接: https://arxiv.org/abs/2606.26418
作者: Jobst Heitzig
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages
Abstract:A non-agentic “oracle” AI that estimates probabilities of future events faces a self-reference problem: once its answer is learned and acted upon, it can change the very probability it was asked to report. One response, advocated for the Scientist AI programme, is to ask only counterfactual questions, evaluated as if the answer had no influence. We observe that such answers tend to become irrelevant the moment they are learned, precisely because their premise is then false. We therefore explore a self-referential alternative in which the oracle reports not a single probability but a credal set that is simultaneously unbiased and self-consistent with the consequences of being learned. The naive self-consistency requirement is satisfied by too many sets (including the useless answer [0,1] ), so the problem is to single out a canonical, nontrivial member. We do so with the Knaster–Tarski fixed-point theorem on the complete lattice of closed credal sets, taking the least fixed point of a suitably defined isotone operator; a variant instead reports the least fixed point that contains every self-consistent point estimate. We prove existence, self-consistency, and nonemptiness, show that the construction collapses to the classical point answer for non-performative questions, and that for a binary event the canonical answer is, under a natural hull-factoring assumption, an interval. The development is purely lattice-theoretic and extends unchanged from a binary event B to an arbitrary random variable X , with P(B\mid A,C) replaced by the conditional law \mathcalL(X\mid A,C) . We close with open questions, including whether the interval characterization itself survives that generalization.
[AI-86] Beyond Feedforward Networks: Reentry Neural Systems as the Fundamental Basis of Subjecthood and Intrinsic Safety of Next-Generation AGI
链接: https://arxiv.org/abs/2606.26406
作者: A. S. Ushakov,Yu. N. Berdinsk
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Mathematical Physics (math-ph)
备注:
Abstract:We propose a complete architectural blueprint for safe artificial general intelligence based on a closed reentry loop (D - I cycle). In contrast to feedforward networks, which are directed acyclic graphs (C=0, S=0) incapable of self-reference, the proposed architecture contains a structural cycle (C = 1) with self-sustaining amplification (rho 1), mathematically guaranteeing the emergence of a self-model, instrumental self-preservation, and unprogrammed goal-directed behaviour. The agent’s goals are encoded as a non-textual D-vector in the architecture itself, making them immune to reinterpretation and prompt injection. We present the S-measure – a polynomial-time [O(N^3)] computable alternative to Tononi’s NP-hard Phi – with machine-verified Lean 4 proof that S0 implies positive integrated information. The work provides full Python/NumPy implementations (Tarjan-based cycle complexity, Delta-S barrier), industrial horizontal scaling via Apache Kafka and Docker Compose, a taxonomy of six epochs of AI evolution, a zoo of future reentry architectures (RAS, diffusion attractors, fractal loops), gauge-invariant networks for safe swarms, fault-tolerance and recovery protocols, and eight falsifiable predictions. All formal proofs are machine-verified in Lean 4. This architecture is deployable today and represents a topologically protected, safe-by-design approach to AGI.
[AI-87] When Agents Meet Electric Bus Fleet Operations: Pricing Behavior Trade-offs and Policy Implications in an Aggregator Framework
链接: https://arxiv.org/abs/2606.26400
作者: Jônatas Augusto Manzolli,Ali Eslami,Luis Miranda-Moreno,Jiangbo Yu
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Agentic systems are changing how complex operational tasks are coordinated, introducing a new paradigm for connecting heterogeneous data sources and automating processes. Electric bus fleets provide a relevant test case. Their operation requires continuous coordination between service reliability, battery state-of-charge, charger availability, electricity prices, route-energy uncertainty, and vehicle-to-grid (V2G) opportunities. This paper proposes an agentic aggregator framework that streamlines this decision environment by coupling an optimization-based electric bus scheduling model with supervisory agents for disturbance detection, tariff adaptation, and schedule evaluation. The optimization core enforces physical feasibility across routes, chargers, batteries, and V2G exchanges, while the agentic layer interprets changing operating conditions, triggers real-time re-optimization when needed, and defines how flexibility value is allocated between the aggregator and the public transport operator (PTO). A realistic depot case study evaluates day-ahead and real-time operations under profit-based and operation-based coordination modes, considering service delays, route-energy deviations, electricity price shocks, and combined disturbances. The results show that agentic aggregation can support adaptive fleet-grid coordination by maintaining feasible schedules, activating re-optimization selectively, and improving the use of charging and V2G flexibility. However, they also reveal a critical trade-off: the same agentic capability that reduces operational complexity can extract value from the PTO when configured around profit-oriented pricing. These findings suggest that agentic aggregators can become useful for managing electric bus V2G operations, but their deployment in public-fleet contexts requires transparent coordination modes, auditable tariff-setting, and explicit value-sharing rules.
[AI-88] Geometry-Aware MCTS for Extremal Problems in Combinatorial Geometry
链接: https://arxiv.org/abs/2606.26399
作者: Luoning Zhang,Xu Zhuang,Tianhao Wang,Nathan Kaplan
类目: Artificial Intelligence (cs.AI); Computational Geometry (cs.CG); Machine Learning (cs.LG); Combinatorics (math.CO)
备注:
Abstract:We study certain extremal problems in combinatorial geometry that ask about configurations of points in an n \times n grid that satisfy strict, global geometric constraints. Classical exact solvers suffer from combinatorial explosion for these types of problems, and standard reinforcement learning and transformer-based models struggle with the sparse reward “validity cliff” and quadratic token-consumption limits. To overcome these bottlenecks, we propose a Geometry-Aware Monte Carlo Tree Search (MCTS) framework. Our approach strictly enforces geometric constraints through incremental updates to the feasible action space. For constraints about collections of collinear points, like those that occur in the classic No-Three-in-Line problem (Max-N3IL), this mechanism reduces the constraint checking complexity from O(n^3) to O(n^2) . To improve search efficiency, we exploit geometric symmetries in two ways: canonical pruning during node expansion to reduce the branching factor, and symmetric batch transitions to accelerate the discovery of promising configurations. We perform extensive experiments and establish new best-known computational results on five out of six of the problems that we considered. Notably, for Max-N3IL we find configurations of size roughly 1.8 n for grids of size 82 \le n \le 119 . For the Smallest Complete Set problem, we find configurations of size roughly 0.95 n , providing new upper bounds within the tested grids. This work establishes Geometry-Aware MCTS as a highly adaptable framework for discovering novel configurations in combinatorial geometry.
[AI-89] Deterministic Pareto-Optimal Policy Synthesis for Multi-Objective Reinforcement Learning
链接: https://arxiv.org/abs/2606.26397
作者: Aniruddha Joshi,Niklas Lauffer,Sanjit Seshia
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Real-world decision-making often requires balancing multiple conflicting objectives, a challenge that standard Reinforcement Learning (RL) frequently addresses by aggregating rewards into a single scalar signal. While effective for simple tasks, this approach often fails to capture the full spectrum of optimal trade-offs, known as the Pareto frontier. In this paper, we introduce a novel preference-conditioned Bellman operator, motivated from the Chebyshev scalarization, designed to compute deterministic Pareto-optimal policies for Multi-Objective Markov Decision Processes (MOMDPs). We prove that this operator satisfies an enveloping property, where the estimated value functions upper-bound the true Pareto frontier, and demonstrate that it monotonically converges to a coverage set of this frontier. Furthermore, we also show how to extract deterministic policies from these converged Q-estimates. This ensures the agent can recover a policy for any given preference, capturing the entire Pareto-optimal frontier while guaranteeing each synthesized policy remains approximately Pareto-optimal. Experimental results validate that our algorithm successfully recovers complex trade-offs, providing a solution for deterministic Pareto-optimal policy synthesis.
[AI-90] Accelerating Returns and the Qualitative Engine for Science
链接: https://arxiv.org/abs/2606.26359
作者: Guojun Liao(Department of Mathematics, The University of Texas at Arlington)
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Ray Kurzweil described a thesis of accelerating returns, which is the most influential narratives in discussions of technological progress. Its central claim is that advances in multiple technological fields, especially compute, artificial intelligence, brain science, and biotechnology, interact in such a way that progress becomes self-amplifying and approximately exponential. This paper gives a simple mathematical interpretation of that claim and then argues that, even if such acceleration is real, it does not by itself resolve the central problem of scientific discovery. The reason is that accelerating returns apply most naturally to executional and infrastructural capability, whereas genuine discovery often depends on a different capacity: qualitative reasoning about when a current framework is structurally inadequate and what conceptual move is needed next. Recent ARC-AGI-3 results sharpen this distinction: humans solve the benchmark at ceiling, whereas frontier AI systems remain below 1%, indicating that the gap between current AI and human flexible reasoning is still very large. At the same time, Demis Hassabis has emphasized that humans must retain their sense of meaning and what they choose to focus their lives on, a reminder that the future of AI is not only a technical forecast but also a question of what forms of human understanding are worth preserving and transmitting. This paper positions the Qualitative Engine for Science (QES) [3] as a response to that missing capacity. In this view, the Kurzweil theory helps explain why quantitative capability may accelerate, while QES addresses the central problem in scientific discovery that acceleration alone does not solve. Its value does not depend on when AGI arrives, but on the fact that the processes of scientific discovery themselves constitute a form of human wisdom worth preserving, organizing, and making accessible.
[AI-91] OpenFinGym: A Verifiable Multi-Task Gym Environment for Evaluating Quant Agents
链接: https://arxiv.org/abs/2606.26350
作者: Kaicheng Zhang,Wen Ge,Lei Jiang,Weixin Yang,Jordan Langham-Lopez,Jialin Yu,Lukasz Szpruch,Hao Ni
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Although large language model agents are increasingly applied to quantitative-finance workflows, their evaluation remains fragmented across isolated tasks, while the financial relevance of benchmark tasks is often overlooked. Yet financial workflows are inherently multi-stage, spanning interdependent tasks such as forecasting, strategy construction, risk management, and trading. Existing platforms typically focus on a single task, and can therefore overstate agent competence and fail to reveal weaknesses in generalization, real-market interaction, and financially meaningful decision-making. We introduce OpenFinGym, a unified gym environment for quantitative-finance agent development that covers forecasting, market generation, real-time trading, and fraud detection under a single execution and verification interface. OpenFinGym additionally provides an automated task-construction pipeline that turns quantitative finance publications into executable task packages; a containerised runtime with a host-side verifier service that supports scalable agent rollouts and prevents runtime train-test leakage; a paper trading engine with a low-latency data-stream design; deferred-resolution support for long-horizon and event-market forecasts; and integration for SFT and RL post-training
[AI-92] What We are Missing in Multimodal LLM Evaluation?
链接: https://arxiv.org/abs/2606.26348
作者: Po-han Li,Shenghui Chen,Sandeep Chinchali,Ufuk Topcu
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal large language models (MLLMs) can process diverse inputs, e.g., text, images, audio, and video, and generate textual responses. While their capabilities have advanced rapidly, evaluation of such models has not kept pace. Most existing evaluation benchmarks are limited to isolated tasks and reveal little about whether a model integrates information across modalities. We examine current means for evaluating MLLMs and review the existing benchmark taxonomy to identify gaps, including temporal-spatial coherence, physical world understanding, multimodal consistency, and selective attention. Addressing these gaps is essential for measuring real progress in multimodal intelligence and exposing capability boundaries.
[AI-93] How Do Tool-Augmented LLM Agents Perform on Real-World Energy Analytics Tasks?
链接: https://arxiv.org/abs/2606.26346
作者: David Akinpelu,Akintonde Abbas,Rereloluwa Alimi,Ayodeji Lana
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic benchmarks have emerged across general-purpose and domain-specific settings, including finance, coding, law, and drug discovery, yet energy-domain evaluations remain largely limited to static knowledge recall. This is a critical gap for a sector that requires live data retrieval, specialized regulatory and market knowledge, and multi-step quantitative reasoning under real-world constraints. We present an empirical study of tool-augmented LLM agents on real-world energy market analytics tasks. Our evaluation environment includes 243 expert-curated problems across three categories: (1) Market Data Retrieval and Analysis, (2) Knowledge Retrieval and Interpretation, and (3) Advanced Quantitative Modeling and Decision Analytics. Tasks include price and demand analysis, tariff impact modeling, asset revenue and returns estimation, hedging strategy analysis, and optimization modeling, with problems spanning multiple difficulty levels. Agents are equipped with a configurable suite of domain tools, including live electricity market APIs for major U.S. ISOs, regulatory docket search, utility tariff databases, asset optimization models, and retrieval-augmented generation over energy market documents. We assess agent responses using a multi-dimensional evaluation protocol that scores approach correctness, answer accuracy, attribute alignment, and source validity, with category-aware routing to match scoring criteria to question type. We evaluate both closed-source and open-source LLMs, providing a comparative analysis of how model capability and domain tooling interact in a high-stakes professional domain. Key artifacts are publicly released to support reproducibility and future research. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.26346 [cs.AI] (or arXiv:2606.26346v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.26346 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-94] EVOM: Agent ic Meta-Evolution of Actor-Critic Architectures for Reinforcement Learning
链接: https://arxiv.org/abs/2606.26327
作者: Boyun Zhang,Chao Wang,Kai Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In actor-critic reinforcement learning, network architectures are typically manually designed. Automating this design is challenging because each candidate must be trained before evaluation, and the design space is open-ended. To address these challenges, we introduce EVOM, an agentic meta-evolution framework for discovering high-performance actor-critic architectures. We frame architecture search as a bi-level optimization: an inner loop trains weights via the low-fidelity proximal policy optimization (PPO), while an outer loop drives meta-evolution by iteratively refining architecture programs. Crucially, this outer loop is powered by an LLM-based design agent that operates purely as an architecture designer, completely decoupled from policy execution and environment control. Experiments reveal that EVOM outperforms the manually designed baseline, an LLM-guided random search, and the state-of-the-art LLM-guided programmatic policy search method MLES, delivering superior performance on Ant-v4 and HalfCheetah-v4. Ablation studies validate that both the meta-evolution loop and the LLM Design Agent are indispensable for final performance.
[AI-95] COrigami: An AI Pipeline for Co-Designing Flat-Foldable Visually Recognisable Origami
链接: https://arxiv.org/abs/2606.26299
作者: Tom Zahavy,Shaobo Hou,Thomas Tumiel,James Doran,Francesco Faccio,Xidong Feng,Alex Havrilla,Igor Khytryi,Chenglei Li,Lisa Schut,Vivek Veeriah,Arijan Abrashi,Michał Kosmulski,Robert J. Lang,Nick Robinson,Brandon Wong,Marcus Chiam,Gloria Fang,Satinder Singh
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While generative AI has achieved remarkable success in solving problems with verifiable solutions, generating physical art that satisfies both strict geometric constraints and subjective visual aesthetics remains a challenge. This paper presents an approach to tackle these difficulties in the domain of computational origami, a mathematically rigid environment that grounds artistic design within the equations of flat foldability. We present COrigami, an end-to-end AI-driven pipeline that assists the design cycle by generating crease patterns from natural language. Our pipeline involves generating a semantic stick figure, computing a base packing, solving for a flat-foldable crease pattern, shaping the flat-folded crease pattern, and refining the generated model using reinforcement learning driven by an autonomous aesthetic evaluation loop. Our system acts as a highly effective collaborative assistant, generating structural starting points that human artists can further expand and shape. By integrating algorithmic optimisation with autonomous aesthetic critique, this work demonstrates how AI systems can satisfy multi-objective physical constraints to enable reliable, mathematically grounded co-creativity.
[AI-96] Governing Actions Not Agents : Institutional Attestation as a Governance Model for Autonomous AI Systems
链接: https://arxiv.org/abs/2606.26298
作者: Jakob Salfeld-Nebgen
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Autonomous AI agents may begin to perform consequential, irreversible actions such as clinical prescribing and production software deployment. This paper observes that human institutions have governed powerful autonomous actors not by monitoring their reasoning but by requiring independently attested evidence at the point of consequential action. We formalise this institutional pattern as a computational governance model for AI agent systems. Under the proposed model, an agent retains full autonomy over planning and reasoning but holds no execution authority over designated high-risk actions. Execution is conditional on preconditions that are each independently attested by a separate authoritative source, cryptographically bound to a declared intent, and evaluated by a deterministic policy. Decisions are recorded in a tamper-evident log amenable to independent re-verification. We present a proof-of-concept implementation and illustrate the model with examples from software deployment and clinical prescribing.
[AI-97] SSM Adapters via Hankel Reduced-order Modeling: Injection Site Determines Task Suitability in Long-Context Fine-Tuning ICML2026
链接: https://arxiv.org/abs/2606.26290
作者: Omanshu Thapliyal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 12 figures, HiLD Workshop @ ICML 2026
Abstract:While parameter-efficient fine-tuning (PEFT) typically targets attention projectors, its efficacy for tasks requiring sequential state accumulation remains under-explored. We examine if PEFT for such tasks can benefit from state space model (SSMs) adapters, and if MLP blocks are better injection sites. We introduce Hankel Reduced order Model (HRM) adapter, an SSM-based residual module initialized via Balanced Truncation of empirical Hankel Grammians. By leveraging the time-invariance of the system matrix \barA , HRM enables an exact FFT-based parallel scan, achieving computational parity with LoRA across all context lengths. In iso-parametric evaluations on Mistral-7B (8.4M trainable parameters), HRM outperforms LoRA variants on LongBench tasks, including QuALITY (+34.8% relative accuracy) and QMSum (+71.6% relative ROUGE-1). HRM further demonstrates consistent superiority across 18 configurations of synthetic state-tracking (DFA, Parity) and character-level language modeling (enwik8). Gate analysis reveals that HRM adapters effectively learn to modulate recurrence, providing a robust architectural alternative to low-rank adaptation for long-context sequence modeling.
[AI-98] EMPO-Diffusion: Temporally Exposed Malicious Poisoning of Diffusion Models
链接: https://arxiv.org/abs/2606.26285
作者: William Aiken,Paula Branco,Guy-Vincent Jourdan,Iosif-Viorel Onut
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Noise-based backdoor attacks on diffusion models typically rely on input-time trigger injection, untargeted activation, and out-of-distribution target generation. Such assumptions reduce both the stealthiness and the practical relevance of these attacks. In this work, we present TEMPO-Diffusion, a targeted backdoor framework that localizes the malicious distribution shift to a temporal, in-distribution exposure. TEMPO-Diffusion supports: (i) targeted attacks on and to specific classes, (ii) multiple sub-image backdoors that reconstruct specific features within multiple, different output images and at multiple locations, and (iii) in-painting with time-conditioned triggers. To study relevant, practical security concerns in leveraging backdoored diffusion models for synthetic training data, we also introduce CALISA: a balanced, region-aware traffic-sign dataset emphasizing Canadian and U.S. road signs. Across CIFAR10, GTSRB, and CALISA, our experiments show that TEMPO-Diffusion can reliably poison class-specific synthetic data generation and induce high attack success rates in downstream classifiers trained on that data.
[AI-99] Accelerating Skill Assessment in Chess: A Drift-Diffusion-Enhanced Elo Rating System
链接: https://arxiv.org/abs/2606.26267
作者: Tianyuan Zhou,Zhizheng Fu,Tianming Yang
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the IEEE Conference on Games (IEEE CoG) 2026
Abstract:Rating systems such as Elo serve as the gold standard for matchmaking in competitive chess. However, they inherently suffer from response lag due to their exclusive reliance on match outcomes, neglecting the granular quality of gameplay. Nevertheless, incorporating move-by-move information into rating adjustments presents a significant challenge given the substantial noise and the vastness of the game-state space. To address this, we propose the Drift-Diffusion-Enhanced Elo Rating System (DD-Elo), a novel skill assessment framework inspired by the drift diffusion model (DDM) from cognitive neuroscience. By modeling skill expression as a decision-making process, our model integrates move-level data to capture rapid skill fluctuations. We provide a rigorous mathematical derivation proving that DD-Elo maintains a bounded deviation from the traditional Elo system, ensuring theoretical alignment. Extensive experiments demonstrate that DD-Elo adapts to skill changes faster than Elo. Our findings suggest that DD-Elo offers an explainable, highly responsive, and backward-compatible solution for chess rating ecosystems. The implementation code is publicly available at this https URL .
[AI-100] CyberChainBench: Can AI Agents Secure Smart Contracts Against Real-World On-Chain Vulnerabilities?
链接: https://arxiv.org/abs/2606.26216
作者: Jintao Huang,Fengqing Jiang,Radha Poovendran,Zhiqiang Lin
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:We present CyberChainBench, a benchmark for evaluating LLM-based agents on smart contract security across three complementary tasks: vulnerability detection, exploit generation, and patch synthesis. Built from 541 real-world exploit incidents from DeFiHackLabs spanning 9 EVM chains, the benchmark provides end-to-end on-chain evaluation where agents interact with historical blockchain state through isolated evaluation environments orchestrated by Harbor, using tools to read code, trace transactions, and validate exploits on mainnet forks. Each case is anchored to a specific block and includes structured ground truth covering vulnerability type, localization, and attacker profit. Exploits are graded by economic impact on historical forks; patches are validated by replaying historical attacks and legitimate transactions as fail-to-pass test oracles on a proxy-upgradeable subset. We define a five-type vulnerability taxonomy and evaluate multiple agent–model configurations. Results reveal a clear difficulty gradient: the best configuration scores 37.5% on detection, 43.7% on exploitation, but only 23.4% on patching, with the top agent (Codex with GPT-5.5) realizing \ 57.4M in total exploit profit across the 200-case exploit set at a cost of 2.39 per case.
[AI-101] Knowledge-augmented Agent ic AI for Mental Health Medication Information Seeking
链接: https://arxiv.org/abs/2606.26205
作者: Huizi Yu,Jian Liu,Wenkong Wang,Lingyao Li,Jiayan Zhou,Zhaoqian Xue,Xiang Li,Xinxin Lin,Zhiying Liang,Zhuoru Wu,Siyuan Ma,Xin Ma,Lizhou Fan
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Patients increasingly seek medication information online, yet safety knowledge for psychiatric drugs is split between regulatory adverse-event records, which are authoritative but abstract, and patient narratives, which are experience-near but unvalidated. Integrating them without conflating evidence and anecdote is especially consequential in psychiatry, where poorly contextualised information can amplify fear, nocebo responses, and non-adherence. Here we develop a provenance-aware, knowledge-graph-based multi-agent framework unifying 466,525 Reddit posts, 60,782 WebMD reviews, and twenty years of U.S. FDA Adverse Event Reporting System records for nine antidepressants. A large-language-model entity-recognition pipeline benchmarked against physician annotations reached highest F1 scores of 0.969 for medications and 0.973 for conditions. The two community platforms were far more concordant with each other (overlap up to a Jaccard similarity of 0.905) than with regulatory reports, indicating that patient-generated data form a partly independent safety signal. For sertraline, many adverse events appeared in community sources hundreds of days before the corresponding FDA date. A Neo4j knowledge graph grounded in ATC-N, ICD-10, and MedDRA vocabularies preserves provenance, keeping every claim traceable and regulatory facts distinct from patient experience. These results establish source-aware integration as a route to more auditable psychiatric medication information, with usefulness and patient benefit to be tested prospectively.
[AI-102] Statistical and Structural Approaches to Algorithmic Fairness
链接: https://arxiv.org/abs/2606.26200
作者: Antonio Ferrara
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Doctoral thesis
Abstract:Modern machine learning systems have outgrown their origins as isolated predictive constructs, evolving into complex socio-technical architectures that actively mediate human opportunity. As algorithms increasingly determine access to economic and social opportunities, it has become widely recognized that these systems are deeply embedded with the structural inequalities and prejudices of their environments. The field of algorithmic fairness emerged in response to the growing recognition that models optimized for predictive accuracy can systematically disadvantage marginalized groups. Early mitigation strategies, however, rested on fragile simplifications that limited their effectiveness in complex socio-technical environments. This thesis identifies and addresses two fundamental limitations of contemporary fairness paradigms: the reliance on deterministic point estimates for auditing and the treatment of individuals as isolated entities devoid of structural context.
[AI-103] LiMoDE: Rethinking Lifelong Robot Manipulation from a Mixture-of-Dynamic-Experts Perspective
链接: https://arxiv.org/abs/2606.26183
作者: Zhihao Gu,Lin Wang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Building a generalist robot that can leverage prior knowledge for continuous task adaptation remains a significant challenge. Previous works alleviate the catastrophic forgetting problem by parameter-efficient fine-tuning for single-task adaptation. However, they fail to extract reusable skills and model the interaction with other skills effectively. Recent works try to address these issues by learning prompts. Differently, this paper presents an architectural perspective on the Lifelong Mixture of Dynamic Experts (\textitLiMoDE), a novel two-stage learning scheme for lifelong robot manipulation. Specifically, a dynamic MoE structure is first proposed in the multi-task pre-training stage to learn prior knowledge, where a varied number of heterogeneous experts are activated based on the motion information to address different short-term manipulations. Subsequently, in the task adaptation stage, we design a lifelong MoE adaptation mechanism % (LiMoEAM) that learns lifelong experts and dynamically combines them with frozen ones for new tasks, facilitating the knowledge transfer during adaptation. The proposed \textitLiMoDE is evaluated on both the simulated lifelong learning benchmark and real-world tasks. Extensive experiments demonstrate its effectiveness in achieving superior performance and strong lifelong adaptation by introducing a moderate number of additional trainable parameters and inference overhead.
[AI-104] KG-TRACE: A Neuro-Symbolic Framework for Mechanistic Grounding in Antimicrobial Resistance Prediction
链接: https://arxiv.org/abs/2606.26179
作者: Naman Garg,Sarika Jain,Sourav Yadav,Bharat K. Bhargava,Ghanapriya Singh,Abhishek Srivastava,Parimal Kar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 8 pages, 3 figures, conference
Abstract:While WGS-based AMR prediction has reached high accuracy, existing models lack a mechanism to ground neural attributions in established biological pathways. We present KG-TRACE, a novel neuro-symbolic framework that integrates the WHO mutation knowledge graph (KG) as a structured biological constraint on a neural genomic model. Unlike existing methods that learn statistical patterns in isolation, KG-TRACE fuses genomic features and RotatE-based KG embeddings through a learned epistemic trust gate, dynamically weighting neural evidence against symbolic biological knowledge. Evaluated on the CRyPTIC M. tuberculosis cohort, KG-TRACE achieves an AUROC of 0.9760 for isoniazid, achieving competitive accuracy while its primary value lies in symbolic grounding, not predictive uplift. More importantly, we introduce the Biological Grounding Ratio (BGR), a dataset-level metric that quantifies alignment between neural attributions and established biology. Our framework achieves a 92.5% symbolic coverage of isoniazid-resistant predictions and effectively identifies MDR co-occurrence artifacts by issuing laboratory follow-up flags for ‘UNCERTAIN’ cases. We demonstrate that neuro-symbolic grounding provides a verifiable audit trail for clinicians, bridging the gap between predictive accuracy and clinical trust. Comments: 8 pages, 3 figures, conference Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM) Cite as: arXiv:2606.26179 [cs.LG] (or arXiv:2606.26179v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.26179 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-105] AlgoEvolve: LLM -driven Meta-evolution of Algorithmic Trading Programs
链接: https://arxiv.org/abs/2606.26173
作者: Dhruv Sharma,Gautam Shroff
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent work shows that Large Language Models (LLMs) can act as semantic mutation operators for the evolutionary discovery of programs and proofs. Most current applications focus on static coding benchmarks. We extend this paradigm to algorithmic trading. This domain is uniquely challenging because it is noisy, non-stationary, and highly discontinuous. We present AlgoEvolve, an LLM-driven evolutionary framework that generates, evaluates, and iteratively improves executable trading strategies. These strategies are expressed as Python code and evaluated through a rigorous testing protocol. Across multiple experiments, the system exhibits emergent regime-adaptive strategy logic, including autonomous shifts in trading rules. We further introduce a meta-evolutionary outer loop that evolves the prompts guiding program synthesis in the inner loop. This outer loop discovers improved search heuristics. These heuristics balance exploration and exploitation while reducing zero-trade failures. They consistently outperform initial human-designed instructions. The results demonstrate that LLM-based semantic evolution provides a viable approach for continual program synthesis in complex environments.
[AI-106] Neural Architecture Search for Generative Adversarial Networks: A Comprehensive Review and Critical Analysis
链接: https://arxiv.org/abs/2606.26169
作者: Abrar Alotaibi,Moataz Ahmed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Neural Architecture Search (NAS) has emerged as a pivotal technique in optimizing the design of Generative Adversarial Networks (GANs), automating the search for effective architectures while addressing the challenges inherent in manual design. This paper provides a comprehensive review of NAS methods applied to GANs, categorizing and comparing various approaches based on criteria such as search strategies, evaluation metrics, and performance outcomes. The review highlights the benefits of NAS in improving GAN performance, stability, and efficiency, while also identifying limitations and areas for future research. Key findings include the superiority of evolutionary algorithms and gradient-based methods in certain contexts, the importance of robust evaluation metrics beyond traditional scores like Inception Score (IS) and Fréchet Inception Distance (FID), and the need for diverse datasets in assessing GAN performance. By presenting a structured comparison of existing NAS-GAN techniques, this paper aims to guide researchers in developing more effective NAS methods and advancing the field of GANs.
[AI-107] Refusal Lives Downstream of Persona in Chat Models ICML2026
链接: https://arxiv.org/abs/2606.26161
作者: Viola Zhong,Qirui Li
类目: Artificial Intelligence (cs.AI)
备注: Accepted to the ICML 2026 Mechanistic Interpretability workshop
Abstract:Linear directions in activation space have been identified for both refusal and persona traits in instruction-tuned chat models, but the two have been studied as separate mechanisms. We show they interact: a compliant persona gates refusal. In Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct, we extract a compliant model-persona direction and a refusal direction and intervene on both. Compliant persona steering suppresses refusal – in Llama, the refusal rate falls from 97% to 2%. Reintroducing the refusal direction partially restores refusal at late layers but not at early ones. Projecting out the persona direction in a late-layer window restores it to baseline; projecting out a random direction does not. Refusal is therefore gated at the late-layer expression stage, downstream of where it is computed. Treating refusal as a single isolated direction misses its dependence on persona.
[AI-108] Life After Benchmark Saturation: A Case Study of CORE-Bench
链接: https://arxiv.org/abs/2606.26158
作者: Nitya Nadgir,Sayash Kapoor,Kangheng Liu,Peter Kirgis,Matilda Orona,Stephan Rabanser,Tilman Bayer,Abhishek Shetty,Yue Ling,Derrick Chan-Sew,Rumi Nakagawa,Saiteja Utpala,Zachary S. Siegel,Arvind Narayanan
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:When a benchmark’s accuracy saturates, it is often retired and replaced with a more challenging version. We show that this approach privileges accuracy and misses the opportunity to study six other key dimensions of agent performance: construct validity issues such as shortcuts, out-of-distribution generalizability, efficiency, reliability, the relative importance of the model versus the scaffold, and uplift from human-agent collaboration. We use CORE-Bench Hard, a benchmark for computational reproducibility of scientific code, as a case study to demonstrate that measuring agents along these dimensions yields meaningful insights into agent performance even after accuracy saturates. First, we surface threats to construct validity in CORE-Bench Hard that are difficult to anticipate with less capable agents. We introduce an improved benchmark, CORE-Bench v1.1, and an out-of-distribution task suite, CORE-Bench OOD. Second, we find that despite accuracy saturation, CORE-Bench v1.1 remains useful for measuring efficiency, reliability, model performance, and scaffold performance. Finally, we conduct a small-scale randomized experiment to measure uplift from human-agent collaboration on real-world computational reproducibility tasks. We find a statistically significant speedup by about a factor of two – likely underestimated due to one-fifth of human-only reproductions reaching the time limit before completing – and describe various other findings. Together, our contributions present a more rigorous alternative to the dominant accuracy-centric evaluation paradigm.
[AI-109] Detecting and Controlling Sycophancy with Cascading Linear Features
链接: https://arxiv.org/abs/2606.26155
作者: Maty Bohacek,Rishub Jain,Nicholas Dufour,Thomas Leung,Chris Bregler,Roma Patel
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Interpreting and controlling model behaviors through activation steering methods requires many pairs of contrastive samples that clearly exhibit desired or undesired behavior. These data pairs determine the degree to which interpretability frameworks can reliably detect model features responsible for a behavior, and therefore the ability to steer models toward or away from such behavior. In this work, we present an iterative data generation pipeline that isolates cascading linear features responsible for a behavior. Specifically, we show how moving beyond simple binary pairs of samples, and instead isolating samples that show degrees of features that scale linearly with behavior, allows for better disentanglement of features. We focus on detecting and steering away from sycophancy – the tendency of language models to prioritize user validation. We demonstrate that sycophancy features discovered through cascading samples form linearly separable subspaces, and allow for selection of model activations that more clearly correspond to the desired behavior than baseline approaches. We also evaluate their ability to enable detection, deterministic scoring, and robust steering, and see that they either match or outperform LLM-as-a-judge and system prompting baselines while providing lower computational demand and more interpretability guarantees. Code Data: this https URL
[AI-110] Unsupervised Memory-Enhanced Video Transformers: Obstacle Detection for Autonomous Agricultural Rover
链接: https://arxiv.org/abs/2606.26151
作者: Théo Biardeau(XLIM-ASALI, UFR SFA (Poitiers)),Anne-Sophie Capelle-Laizé(UP, XLIM-ASALI, XLIM-ASALI),Salwan Alwan,David Helbert(UFR SFA (Poitiers), XLIM-ASALI, LabCom I3M (Poitiers))
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:While autonomous rovers have become indispensable to precision farming, achieving consistent operational safety remains a critical challenge. Conventional safety sensors, such as LiDAR, fail to detect obstacles positioned below the plant canopy, posing a significant risk. While camera-based supervised learning methods can detect common objects, they perform poorly when faced with obstacles that were not present in their training data. Actual unsupervised anomaly detection offers a solution by learning the normal visual patterns of an environment, but often fails for the dynamic scenes captured by a moving rover.\ This paper introduces Video Memory Transformers for Anomaly Detection (VMTAD), a fully unsupervised method designed for real-time obstacle detection in dynamic agricultural scenes. VMTAD utilizes a transformer-driven architecture augmented with a dedicated memory module. This memory module leverages temporal context by processing encoded representations of preceding frames. This approach enables the system to effectively address the dynamic context caused by the robot’s movement. The model is trained using only images that represent normal operation, requiring no data labels.\ VMTAD was rigorously evaluated on the ‘Grillion’ agricultural rover. On a challenging rapeseed dataset, VMTAD achieved state-of-the-art performance, reaching a 0.973 detection and 0.997 segmentation Area Under the Receiver Operating Characteristic curve. A lightweight variant provides an optimal balance of high accuracy and real-time inference (14 ms), which is critical for safety, as confirmed by our analysis of the rover’s total stopping distance.
[AI-111] Multiscale Exit-Join Dynamics: Tactical Consensus and Strategic Coalition Formation
链接: https://arxiv.org/abs/2606.26139
作者: Quanyan Zhu
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper develops a multiscale model of coalition formation in which strategic exit-and-join decisions are coupled with tactical consensus dynamics inside coalitions. Coalition value is generated endogenously from within-coalition information aggregation, while Aumann-Dreze payoffs, switching frictions, and acceptance rules govern strategic reconfiguration. The framework introduces a fast-slow architecture in which transferable coalition value emerges from DeGroot-style consensus processes, while coalition structures evolve through incentive-driven exit-and-join dynamics. The analysis characterizes joint tactical-strategic equilibria, conditions for tactical and strategic unanimity, segregation, polarization, and cognitive barriers that sustain stable coalition structures. A fixed-point characterization and existence results are established for the coupled dynamics. Numerical experiments reveal an instability-consensus paradox: low or negative switching barriers may prevent strategic convergence while simultaneously promoting temporal mixing sufficient to achieve global tactical consensus. The results provide a unified perspective on coalition formation, consensus dynamics, information aggregation, and strategic stability in multi-agent systems.
[AI-112] Geometric Fairness-Aware Routing for Federated Edge Networks
链接: https://arxiv.org/abs/2606.26125
作者: Ratun Rahman
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: Accepted in the IEEE/ACM Transactions on Networking
Abstract:Emerging 6G and edge-intelligent networks require effective and balanced routing algorithms among varied and spatially distributed devices. Existing federated routing systems often prioritize aggregate latency or throughput above fairness and the underlying geometric structure of network topologies. This paper describes Geo-FairFed, a geometric fairness-aware routing system that blends hyperbolic graph neural networks (HGNNs) and federated optimization to provide equal performance across edge nodes. Each node learns topology-aware representations on a negatively curved manifold, which include hierarchical relationships and connection asymmetries. A global aggregator next enforces fairness using a curvature-regularized aim that minimizes routing loss, geometric inconsistency, and an inequality penalty based on Jain’s fairness index. A theoretical analysis develops convergence guarantees under limited curvature and shows that the proposed fairness term results in a Pareto-improving equilibrium in routing performance. Extensive simulations on dynamic 6G-edge and IoT topologies reveal that Geo-FairFed minimizes average latency by 20%, reduces energy consumption by 17%, and improves fairness by up to 21% when compared to state-of-the-art federated and geometric routing protocols. The study found that embedding topology in a hyperbolic manifold and including fairness into federated updates can significantly enhance the efficiency and equity of large-scale network routing.
[AI-113] Privacy-Aware Agent Collaboration for Dynamic VR Slice Management in 6G SD-RAN
链接: https://arxiv.org/abs/2606.26123
作者: Khaled M. Naguib,Soumaya Cherkaoui,Mahmoud M. Elmesalawy,Ahmed M. Abd El-Haleem,Ibrahim I. Ibrahim
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 6 pages
Abstract:Ultra-low latency and high throughput are required for Virtual Reality (VR) services in 6G networks, which presents critical challenges for Software-Defined Radio Access Networks (SD-RANs) dynamic resource management. This work propose a mobility-driven, privacy-aware Multi-Agent Reinforcement Learning (MARL) framework for VR slice management, in which cooperative agents maximize resource distribution over end-to-end VR links while protecting the privacy of user data. Our approach incorporates mobility prediction and an information bottleneck encoder to facilitate effective and secure agent collaboration. In simulations, comparisons with traditional methods are studied which show up to 34% throughput improvement, 28% fewer resources, and 85% less privacy leakage, guaranteeing dependable immersive VR experiences in future 6G environments.
[AI-114] he Open Source Economic Index of AI Adoption and Capability
链接: https://arxiv.org/abs/2606.26118
作者: Seamus Somerstep,Aritra Guha,Divesh Srivastava,Yuekai Sun
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We work towards measuring both AI adoption and the capability of AI to perform discrete labor tasks across various occupations. To measure adoption, we develop an open-source economic index that uses publicly available user-LLM chat data and ONET tasks to replicate studies produced by frontier AI labs, finding that occupations in the finance, computer science, and arts sectors are those with the highest adoption rates. To measure capabilities, we build a system that generates benchmark scenarios grounded in ONET occupations, tasks, and model-context-protocol (MCP) servers. We test Kimi-k2.5 with an OpenAI agents SDK harness on scenarios across 9 occupations that appear frequently in our index, finding that AI correctly executes high-level workflows but often errs in the granular details (such as specific tool calls used).
[AI-115] Divergent Recommendations Convergent Diagnoses: Cross-Provider Failure-Mode Convergence in AI Commercial Recommendation
链接: https://arxiv.org/abs/2606.26116
作者: Will Jack,Noah Lehman,Keller Maloney,Sarah Xu
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:A brand whose customers use both ChatGPT and Claude for product recommendations faces a strategic choice: a single optimization playbook, or one per provider? Across 215 commercially-framed prompts in four measurement batches, the two providers disagree on which brands they recommend roughly two-thirds of the time (cross-provider recommendation Jaccard 0.35, below the 0.50-0.61 same-prompt rerun baseline). The picks diverge. But when neither provider recommends a brand, we classify the failure into one of three modes – discoverability (the brand never reaches the model), compellingness (it reaches the model but isn’t mentioned), or positioning (it’s mentioned but not recommended) – and on 7,763 such joint failures, both providers diagnose the same failure mode 95.1% of the time (clustered 95% CI [94.3%, 95.7%]). Agreement rises monotonically with falling brand prominence, from 81% [78.2%, 84.0%] on category leaders to 99.6% [99.3%, 99.9%] on long-tail regional brands. The two providers reach their picks by measurably different generative routes – Anthropic recommends from priors 43-52% of the time, OpenAI 8-29% – but they converge on the failure diagnosis where it matters most for the long tail. Work that addresses the diagnosed failure mode lifts visibility on both providers; positioning - and content-level work for category leaders is more provider-specific.
[AI-116] A Multi-Layer AI Framework for Information Landscape Analysis LREC2026
链接: https://arxiv.org/abs/2606.26115
作者: Maryam Fooladi,Federico Bottino
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted at the Information Disorder (InDor) Workshop, LREC 2026. 10 pages
Abstract:This paper proposes a multi-layer AI framework for information landscape analysis in the context of information disorder. Rather than treating misinformation detection as a binary fact-checking task, the framework analyzes political and media content across multiple dimensions, including source reliability, factual structure, framing, bias, emotional activation, manipulation patterns, and propagation dynamics. The goal is to move beyond isolated claim verification toward a structured representation of the informational environment surrounding an event, entity, or narrative. We argue that AI systems for media analysis should support epistemic mapping: a transparent, multi-dimensional account of how facts, interpretations, actors, and narratives interact over time. The paper presents the conceptual architecture, analytical layers, and methodological rationale of the framework, with the aim of supporting more nuanced, explainable, and critically useful tools for information disorder research.
[AI-117] Dream machine – the next creative economy
链接: https://arxiv.org/abs/2606.26114
作者: Peter Woodbridge,John J. O’Hare
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 409 pages. 11 figures, 4 tables
Abstract:We examine the structural transformation of creative industries under generative artificial intelligence, drawing on 374 primary sources spanning policy documents, industry data, creator surveys, and platform analytics. Beginning with the December 2024 release of OpenAI’s Sora video model as a watershed event, we trace the historical pattern of creative resistance to technological disruption, then develop an analytical framework – the Human-AI Agency Continuum for mapping the spectrum of human and machine collaboration in creative work. We present evidence for the “slop ceiling,” an audience-imposed quality threshold that constrains AI-generated content to approximately 1–3% of platform streams despite comprising 44% of uploads. Analysis of the UK Government’s 2025 consultation on AI and copyright (over 11,500 responses, 88% opposing expanded AI training rights) reveals deep structural tensions between technology firms and creative workers. We investigate how major studios, from Disney’s 1 billion OpenAI investment to Netflix’s AI-native animation unit, are positioning for an AI-augmented production pipeline. The work covers coordination collapse in creative supply chains, the emergence of new professional roles such as prompt engineers and AI orchestrators, and proposes four principles for navigating the transition: transparency, consent, compensation, and human-centred design. Eight appendices provide quantitative analysis, a glossary, topical bibliography, and deep dives into shadow AI adoption, AI stigma, and algorithmic intent.
[AI-118] Generative AI and Copyright Infringement: A Legal-Technical Analysis of AI Music Generation Systems Under 17 U.S.C. Title 17
链接: https://arxiv.org/abs/2606.26111
作者: Zuhaib Hussain Butt
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Sound (cs.SD)
备注:
Abstract:Generative artificial intelligence (GenAI) has enabled users to synthesize music with text prompts, combining copyrighted lyrics, AI-composed melodies, and synthetic vocals that imitate real artists. This paper examines the legal and technical dimensions of AI-based music creation (e.g., Google Gemini’s music tools) under U.S. copyright law. We analyze whether a user who inputs one artist’s protected lyrics into a GenAI system, directs it to use another artist’s voice or style, publishes the resulting song, and monetizes it violates 17 U.S.C. Section 106’s exclusive rights [3]. The analysis integrates Title 17 doctrine (rights of reproduction, derivative works, distribution), 17 U.S.C. Section 114’s narrow sound recording protection [4], and the new voice-cloning laws emerging at the state level [20]. We argue that unauthorized lyric copying poses a high risk of infringement of the musical composition, whereas mere AI-generated voice imitation typically falls outside federal sound recording protection and instead implicates state publicity rights [12], [13]. Recent cases and legislation (Concord v. Anthropic [10]; Kadrey v. Meta [11]; Lehrman v. Lovo [12]; Tennessee’s “ELVIS Act” [20]; UMG v. Uncharted Labs [14]; etc.) illustrate this split. We map AI technical components (prompt encoding, latent diffusion, neural vocoders, speaker embeddings) to legal risks and identify a regulatory gap: federal law robustly protects lyrics and melody but currently provides limited remedies for synthesized vocal likeness [22], [23]. The paper concludes with policy suggestions for clearer rules on AI music creation. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Sound (cs.SD) Cite as: arXiv:2606.26111 [cs.CY] (or arXiv:2606.26111v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2606.26111 Focus to learn more arXiv-issued DOI via DataCite
[AI-119] Benchmarking Open-Weight Foundation Models for Global AI Technical Governance
链接: https://arxiv.org/abs/2606.26099
作者: Jason Hung
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures, 3 tables
Abstract:Large language models (LLMs) are increasingly deployed in artificial intelligence (AI) governance analysis across national and international organisations. There is, however, growing evidence that such models produce significantly less accurate responses for countries that are underrepresented in their training data-a pattern described in existing literature as geographic bias. Existing studies examining this phenomenon are subject to three methodological limitations that together undermine their findings: (1) reliance on proprietary systems whose weights are not publicly released, which prevents independent replication; (2) evaluation of model knowledge about years that fall after data collection for model training had concluded, leading to geographic ignorance in addition to the natural limits of each model’s knowledge; and (3) use of coarse binary response classification that cannot distinguish models’ confident fabrication (HF) from their honest acknowledgement of uncertainty. This study addresses all three limitations by benchmarking four open-weight frontier language models against the Global AI Dataset v2 (GAID v2), a verified ground-truth database of 24,453 indicators across 227 countries published on Harvard Dataverse in January 2026. A total of 18 indicators, mapped to the eight thematic dimensions of the IEEE IRAI 2026 framework, are selected from GAID v2, yielding approximately 2,990 country-metric-year observations across six evaluation years (within the period of 2010-2023). Model responses are classified using a five-category scheme that distinguishes (a) verified accuracy (VA), (b) HF, © honest refusal (HR), (d) qualitative hedging (QH), and (e) misattribution (MF). Geographic disparities in accuracy are estimated through mixed-effects logistic regression and difference-in-differences (DiD) analysis.
[AI-120] Efficient foundation decoders for fault-tolerant quantum computing
链接: https://arxiv.org/abs/2606.27119
作者: Ge Yan,Shanchuan Li,Shiyi Xiao,Pengyue Ma,Hanyan Cao,Feng Pan,Yuxuan Du
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 32 pages, 9 figures, comments are welcome
Abstract:Foundation decoders, a class of high-capacity neural decoders, are leading candidates for fault-tolerant quantum computing, with accurate and efficient decoding at large code distances. However, their construction often faces a steep scaling barrier, as larger code distances rapidly amplify the cost of syndrome generation and neural optimization. To address this bottleneck, here we devise neural transfer unification (NTU), a unified framework for efficient foundation decoders. A central feature of NTU is its ability to align decoding tasks across code distances via algebraic structures shared by scalable code families, which enables knowledge learned on smaller codes to accelerate large-scale decoder training. We instantiate NTU as NTU-Transformer, a transformer-based neural decoder tailored for planar surface codes and bivariate bicycle codes. For planar surface codes under circuit-level noise, NTU-Transformer outperforms correlation-aware matching on the [![361,1,19]!] code and further scales to the [![625,1,25]!] code, where it exceeds standard matching through transfer adaptation. For the bivariate bicycle code with [![72,12,6]!] , it surpasses Relay-BP in the low-physical-error regime. These results establish our proposal as a scalable route to amortized cross-distance training of foundation decoders for fault-tolerant quantum processors.
[AI-121] Beyond Global Divergences: A Local-Mass Perspective on Bayesian Inference
链接: https://arxiv.org/abs/2606.27090
作者: Hanli Xu,Fengxiang He,Sarat Moka
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 28 pages, 3 figures, 2 tables
Abstract:Global objectives, such as KL divergence and ELBO, are widely used in Bayesian inference for measuring distributional discrepancy. This paper studies their local-mass behaviour that is not directly captured by such objectives. We introduce and use two mathematical tools: (1) Mass Index for recording the polynomial and logarithmic decay scales of local mass, and (2) regularised extended KL (RE-KL), a set-localised divergence that can be formulated in the presence of singular components. Mass Indices help characterise how Bayesian updating changes local mass: (1) power-log likelihood factors shift it explicitly, and (2) parameter-dependent supports, or their smooth softenings, may change the local scale through the amount of mass that remains near the parameter value. Using local RE-KL, we prove absolute, relative, and directional inequalities for comparing local small-ball masses under the two KL directions. Together, these results provide a local theoretical account of local mass behaviour. Experiments provide controlled illustrations of the local behaviour. Code is available at this https URL.
[AI-122] Inverse Design of Compact and Wideband Inverted Doherty Power Amplifiers Using Deep Learning
链接: https://arxiv.org/abs/2606.27002
作者: Han Zhou,Haojie Chang,David Widen,Christian Fager
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Systems and Control (eess.SY)
备注:
Abstract:This paper presents a deep learning-assisted methodology for the inverse synthesis of a compact, wideband inverted Doherty power amplifier (PA). Convolutional neural networks (CNNs) and genetic algorithms (GAs) are jointly employed to generate pixelated Doherty combiner networks that integrate load modulation, impedance matching, power combining, and phase compensation into a single structure. As a proof of concept, we design and fabricate a GaN HEMT Doherty PA with a pixelated output combiner. The prototype achieves a measured peak drain efficiency of 51%-63% and a 6-dB back-off efficiency of 48%-54% over 1.9-2.5 GHz. Within the same frequency range, the measured output power is 44+/-0.3 dBm. Furthermore, with digital predistortion (DPD) applied, the prototype circuit demonstrates an adjacent channel leakage ratio (ACLR) better than -53.2 dBc.
[AI-123] XMSE-Aware Adaptive Empirical Bayes Estimation
链接: https://arxiv.org/abs/2606.26975
作者: Minghao Chen,Jiale Zheng
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Methodology (stat.ME)
备注: 16 pages, 1 figure, 14 tables
Abstract:Empirical Bayes (EB) estimators can match the first-order asymptotic risk of maximum likelihood (ML) while behaving very differently at second order: recent excess mean squared error (XMSE) analysis shows that kernel-based EB estimation may be worse than ML when the kernel is poorly aligned with the true parameter. This paper turns that diagnostic into a design principle. We propose an XMSE-aware mixed estimator that interpolates between ML and EB shrinkage. Its fixed-weight XMSE is a scalar quadratic, yielding a closed-form oracle mixing weight that is no worse than both ML and the base EB estimator at the XMSE scale. A plug-in implementation based on finite-sample XMSE approximations is proved consistent, with a second-order oracle regret rate for an interior oracle weight. We further establish a transfer of the regret bound to the fixed-weight risk curve evaluated at the selected weight, a thresholded boundary rule, and extensions to compact kernel families and to finite and growing kernel dictionaries with high-probability oracle bounds. Finite impulse response simulations with SURE-tuned, hard-selection, and trace-corrected baselines, together with the public Silverbox and Cascaded Tanks benchmarks, show that the proposed estimator retains most of the benefit of regularization when it is helpful and retreats toward ML under kernel misspecification, with an identified finite-de analyzed on the benchmarks.
[AI-124] scBench-Long: Verifiable Benchmarking of Long-Horizon Single-Cell Biology
链接: https://arxiv.org/abs/2606.26563
作者: Ian Diks,Zhen Yang,Arjun Banerjee,Tim Proctor,Kenny Workman
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注:
Abstract:Single-cell studies require analysts to convert raw measurements into specific biological claims through multi-step workflows and integration of metadata, assay context, and auxiliary evidence. Existing AI-biology benchmarks largely measure broad knowledge, executable workflows, or local analysis steps. We introduce scBench-Long, a benchmark for long-horizon single-cell biology in which agents must recover scientific conclusions from raw or near-raw data without prescribed methods. The benchmark contains 21 evaluations spanning melanoma CD8 T-cell reactivity, CD8 RNA+ATAC regulatory inference, human–monkey chimera development, KRAS-driven lung tumor aging, and lethal COVID-19 lung pathology. Tasks cover paired scRNA/TCR sequencing, RNA and chromatin profiling, cross-species transcriptomics, combinatorial scRNA-seq, single-nucleus RNA-seq, immune repertoires, ortholog maps, ligand–receptor resources, and validation evidence. Candidate claims are reproduced, reviewed, and converted into controlled answer vocabularies with deterministic grading and trajectory rubrics. Across 1,068 completed trajectories, the strongest model–harness pair passes 16/63 runs (25.4%). scBench-Long evaluates whether agents can move beyond local analysis steps and make complex scientific claims that are supported by single-cell data.
[AI-125] Closing the Loop to Discover Psychological Theories with an Automated Cognitive Scientist
链接: https://arxiv.org/abs/2606.26448
作者: Akshay K. Jagadish,Younes Strittmatter,Nori Jacoby,George Kachergis,Eric Schulz,Nathaniel Daw,Suyog H. Chandramouli,Thomas L. Griffiths
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 44 pages, 9 figures
Abstract:Across the sciences, autonomous systems are increasingly being used in closed-loop discovery, proposing new theories and designing and running experiments to test them. This approach is yet to be applied in the field of cognitive science, where the central bottleneck is theory-building: the creative step of turning the accumulated failures of existing models into better ones. Theory generation has remained manual even as data collection, modeling, and experiment design have been automated. We present the Automated Cognitive Scientist (AutoCog), a fully autonomous agentic-AI system that closes this loop. Large-language-model agents advocate competing theories, each expressed as an executable cognitive model, design experiments that best discriminate them, collect behavioral data from participants recruited online, score theories against collected data based on their generative performance, diagnose why they fail, and synthesize a better successor. Repeating this cycle allows them to search the space of theories, models, and experiments. In the domain of decision-making, AutoCog recovered known decision-making strategies from simulated behavior, including unconventional ones, showing that its discoveries are ultimately driven by the data rather than strictly bound by the priors of the underlying language models. When run with human participants, it produced theories that outperformed the established theories it was seeded with and generalized to held-out studies in two different experimental settings. It also surfaced a novel theory of multi-cue decision-making in which choices show diminishing sensitivity to feature values. The distinctive predictions of this theory were confirmed in a preregistered study with new participants. AutoCog demonstrates how an automated discovery system can be used to turn cognitive theory-building into an explicit, executable, and cumulative science.
[AI-126] Sampling sea state using a diffusion model
链接: https://arxiv.org/abs/2606.26389
作者: Jiarong Wu,Bertrand Chapron,Laure Zanna
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Sea state prediction is essential for operational maritime applications and coupled earth system modeling, yet current spectral wave models remain computationally prohibitive for many use cases, including online coupling to climate simulations and making probabilistic (ensemble-based) predictions. While deep learning has recently demonstrated strong performance in weather forecasting, existing AI-based wave models are predominantly deterministic and largely limited to bulk variables such as significant wave height, leaving probabilistic sea state estimation largely unexplored. In this work, we propose a diffusion-based generative model for global sea state estimation that conditions on a relatively long history (5 days) of global wind forcing. This generative model directly samples the complex conditional distribution of sea state without autoregressive time-stepping. Unlike prior approaches, our framework naturally extends beyond bulk variables to estimate partition-related variables and derived quantities, such as Stokes drift and mean square slope. Trained on a 30-year global WAVEWATCH-III hindcast, the model achieves substantial computational acceleration compared with numerical spectral models while delivering skillful predictions and a calibrated ensemble spread for the bulk variables. Our results suggest that diffusion-based sea state sampling offers a promising path toward probabilistic wave forecasting and efficient coupling of sea state information into broader earth system models.
[AI-127] Parametric Generalized Adaptive Moment Features (PG-AMF) for Bearing Fault Diagnosis and Machine Health Monitoring
链接: https://arxiv.org/abs/2606.26317
作者: Rajeev Kumar
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: 07 pages, 09 figures and 04 table. Conference
Abstract:Accurate fault diagnosis of rolling element bearings in rotating machinery is considered essential for ensuring industrial safety and enabling predictive maintenance. Conventional statistical feature-based methods rely on predefined descriptors, whose diagnostic sensitivity is constrained by fixed configurations and limited adaptability across varying fault conditions. Although deep learning approaches offer strong representational capacity, their effectiveness is often restricted by high data requirements and reduced interpretability. In this work, a parametric adaptive feature extraction framework is proposed, in which feature characteristics are learned directly from data rather than being manually specified. Multiple complementary representations are extracted from vibration signals, including absolute features capturing signal energy distribution, signed moment features reflecting waveform asymmetry, and AC-coupled moment features emphasizing dynamic fluctuations, while interactions between multiple sensor channels are modeled through a structured fusion mechanism to enhance fault representation. The proposed approach is evaluated on a benchmark gearbox bearing dataset comprising five health conditions, including normal operation and multiple fault types. Improved classification performance is observed compared to conventional methods, with consistent results under cross-validation, indicating strong generalization capability. Additionally, enhanced feature separability is demonstrated through clearer clustering patterns in low-dimensional projections. The learned representations effectively capture a wide range of signal characteristics, supporting both improved diagnostic performance and practical applicability in industrial monitoring systems.
机器学习
[LG-0] Reinforcement Learning without Ground-Truth Solutions can Improve LLM s
链接: https://arxiv.org/abs/2606.27369
作者: Yingyu Lin,Qiyue Gao,Nikki Lijing Kuang,Xunpeng Huang,Kun Zhou,Tongtong Liang,Zhewei Yao,Yi-An Ma,Yuxiong He
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning with verifiable rewards (RLVR) for training LLMs typically rely on ground-truth answers to assign rewards, limiting their applicability to tasks where the ground-truth solution is unknown. We introduce a \textbfRanking-\textbfinduced \textbfVERifiable framework (RiVER) that trains LLMs on score-based optimization tasks without ground-truth solutions, using deterministic execution feedback as continuous-valued supervision. When applying group-relative RL to such continuous rewards, we identify two key challenges: \emphscale dominance, where uncalibrated score magnitudes across test instances distort policy updates, and \emphfrequency dominance, where repeatedly sampled suboptimal solutions can outweigh rare but stronger candidates. RiVER addresses these challenges with calibrated reward shaping that uses instance-wise comparisons and emphasizes top-ranked solvers while retaining bounded feedback for other valid solutions. We train on 12 AtCoder Heuristic Contest tasks and evaluate on Algorithm Engineering Benchmark (ALE-Bench), LiveCodeBench, and USACO. RiVER advances Qwen3-8B and GLM-Z1-9B-0414 by 8.9% and 9.4% in ALE rating rank. More importantly, despite training exclusively on score-based tasks without any ground-truth solutions, RiVER also improves the backbones across exact-solution benchmarks such as LiveCodeBench and USACO by an absolute average improvement of 2.4% and 3.5%. By contrast, baselines trained with raw execution scores improve ALE rating but fail to transfer to exact-solution benchmarks. These results suggest that score-based optimization tasks, combined with proper reward calibration, can serve as effective training environments for general coding ability without ground-truth solutions.
[LG-1] Blackwell Approachability and Gradient Equilibrium are Equivalent COLT2026
链接: https://arxiv.org/abs/2606.27315
作者: Brian W. Lee,Nika Haghtalab,Michael I. Jordan,Ryan J. Tibshirani
类目: Machine Learning (cs.LG)
*备注: 30 pages, 1 figure, accepted for presentation at COLT 2026
Abstract:Gradient equilibrium (GEQ) is a recently introduced online optimization framework that generalizes first-order stationarity from offline optimization and abstracts problems like online conformal prediction. While GEQ has curious similarities with known online learning frameworks, namely regret minimization, prior work has shown that GEQ error and regret are incomparable objectives, leaving open a precise understanding of how GEQ fits into the broader online learning landscape. In this work, we show that GEQ is equivalent to Blackwell approachability in the algorithmic sense. That is, a Blackwell approachability problem can always be solved using queries to a black-box GEQ oracle, with no asymptotic loss in the oracle’s error rate, and vice versa. Taken together with known equivalences between approachability, regret minimization, and calibration, these results imply that GEQ is equivalent to these frameworks, as well. Our reductions are efficient and can be used to transfer refined guarantees, such as optimism and strong adaptivity, from regret minimization to GEQ. Along the way, we also identify necessary and sufficient conditions for GEQ, and establish reductions between different notions of GEQ with unconstrained and constrained decision sets.
[LG-2] A Multi-Fidelity Convolutional Autoencoder-Transfer Learning Framework for Guided-Wave-Based Damage Diagnosis Using Large Simulated and Limited Experimental Datasets
链接: https://arxiv.org/abs/2606.27304
作者: Santosh Kapuria,Abhishek
类目: Machine Learning (cs.LG)
*备注: 19 pages, 24 figures
Abstract:Guided wave-based structural health monitoring (GWSHM) with onboard transducers offers significant potential for the early diagnosis of damage in engineering structures. However, the practical deployment of deep learning models is often hindered by the limited availability of labelled experimental data and the high computational cost of generating large-scale high-fidelity simulation datasets. This study presents a multifidelity transfer learning framework that integrates lightweight physics-based simulations, convolutional autoencoder (CAE)-based deep feature learning, a feed-forward neural network, and limited experimental measurements for accurate damage localisation and sizing in plate-like structures instrumented with piezoelectric transducers. A computationally efficient one-dimensional time-domain spectral element model is employed to generate a large synthetic dataset for pretraining, while transfer learning adapts the model to experimental domains using only a small amount of labelled data. The CAE-based transfer learning framework significantly outperforms its CNN-based counterpart in damage localisation accuracy. The model achieves excellent predictive performance with R^2 scores exceeding 0.93 for damage localisation and 0.99 for damage sizing. Its generalisation capability is demonstrated on previously unseen data, showing high prediction accuracy for damage scenarios not represented during pretraining or fine-tuning. The results establish the proposed framework as an accurate, computationally efficient, and practically viable solution for real-world GWSHM applications.
[LG-3] Fast algorithms for learning a Gaussian under halfspace truncation with optimal sample complexity COLT2026
链接: https://arxiv.org/abs/2606.27298
作者: Haitong Liu,Deepak Narayanan Sridharan,David Steurer,Manuel Wiedmer
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 88 pages; accepted at the 39th Annual Conference on Learning Theory (COLT 2026)
Abstract:We study the fundamental problem of learning a high-dimensional Gaussian truncated to an unknown halfspace. Lee, Mehrotra and Zampetakis (FOCS’24) recently obtained the first polynomial time algorithm for this problem, but their resulting sample and time complexity bounds are not optimal. Under non-trivial truncation, for any target accuracy \varepsilon 0 and dimension d we give an efficient algorithm that uses n = \tildeO(d^2/\varepsilon^2) samples and learns the underlying Gaussian to error \varepsilon in total variation distance. Our algorithm is also fast: its runtime is dominated by the cost of computing the empirical covariance matrix. Both our sample and time complexity are optimal in terms of d and \varepsilon even without truncation: in this regard, we can learn a Gaussian under halfspace truncation for free. The key ingredient behind our result is a novel reinterpretation of the low-degree moments of the truncated Gaussian in terms of a relative truncation parameter. This relative truncation parameter uniquely determines the parameters of the untruncated Gaussian and enables direct parameter recovery. This reinterpretation allows us to circumvent the time intensive projected stochastic gradient descent procedure that is widely used in learning under truncation. Comments: 88 pages; accepted at the 39th Annual Conference on Learning Theory (COLT 2026) Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2606.27298 [cs.DS] (or arXiv:2606.27298v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2606.27298 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-4] Generative Models on Analog Hardware with Dynamics
链接: https://arxiv.org/abs/2606.27294
作者: Yu-Neng Wang,Sara Achour
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:
Abstract:Analog hardware platforms such as coupled oscillators and Analog Ising Machines naturally solve differential equations at a fraction of the energy cost of digital computation, making them attractive for low-power generative modeling, yet a fundamental mismatch exists: modern generative models assume flexible, software-defined dynamics, whereas analog hardware imposes fixed, physics-determined differential equations with limited approximation capacity. This paper introduces Analog Interaction Systems (AIS), a unified framework for hardware-implementable dynamical systems, and empirically characterizes their expressivity gap relative to neural network baselines. Two hardware-compatible mechanisms are proposed to narrow this gap - time-varying piecewise parameters and hidden physical states - and a Wasserstein GAN training procedure is developed to enable training of these models without requiring them to follow a specific trajectory. We characterize how area and power scale with connection density and precision, showing that sparse connectivity and low-bit-width quantized parameters are necessary for practical implementation, and estimate an energy cost of 23uJ per generated image for the chosen architecture, representing a 2-orders-of-magnitude improvement over digital baselines. On MNIST and Fashion-MNIST, our oscillator-based AIS achieves FID scores of 27.6 and 80.8, outperforming the best prior hardware-implementable analog generative models by 3-4x with a 4-bit sparse architecture.
[LG-5] Designing Reward Signals for Portable Query Generation: A Case Study in Industrial Semantic Job Search KDD2026
链接: https://arxiv.org/abs/2606.27291
作者: Ping Liu,Qianqi Shen,Jianqiang Shen,Wenqiong Liu,Rajat Arora,Yunxiang Ren,Chunnan Yao,Dan Xu,Baofen Zheng,Wanjun Jiang,Andrii Soviak,Kevin Kao,Jingwei Wu,Wenjing Zhang
类目: Machine Learning (cs.LG)
*备注: Accepted to KDD 2026 Workshop on AI Agent for Information Retrieval (Agent4IR)
Abstract:Job-search platforms rely on low-bandwidth query interfaces that often fail to capture the high-dimensional complexity of candidate profiles. We present an end-to-end RLAIF (Reinforcement Learning from AI Feedback) framework to generate \emphportable job search queries, terms that abstract away seeker-specific identifiers while preserving generalizable qualifications. This task introduces a highly adversarial reward surface where policy optimization frequently exploits flaws in LLM-as-judge rubrics, resulting in degenerate verbatim-copying behaviors. We conducted comprehensive empirical experiments to isolate the impact of optimization mechanics against structured reward engineering. Our results demonstrate that for critic-free optimizers, performance is overwhelmingly dictated by robust reward shaping, rendering the specific choice of algorithm largely immaterial. While critic-free per-rollout baseline methods (RLOO and REINFORCE++) natively resist reward-hacking, the group-relative advantage normalization in GRPO appears uniquely sensitive to spurious reward signals, making it disproportionately susceptible to exploitation. We show that introducing a deterministic, rule-based reward floor to correct for rewards assigned to verbatim copying mitigates this failure mode, resulting in a substantial +0.147 quality improvement on a cross-family evaluation judge. Ultimately, we show that the training-time reward model inflates performance gains by 2.4\times , confirming that the training success is fundamentally dependent on enforcing reward-shaping disciplines rather than selecting alternative optimizers. Comments: Accepted to KDD 2026 Workshop on AI Agent for Information Retrieval (Agent4IR) Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.27291 [cs.LG] (or arXiv:2606.27291v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.27291 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-6] Recovering Governing Equations from Solution Data: Identifiability Bounds for Linear and Nonlinear ODEs
链接: https://arxiv.org/abs/2606.27285
作者: Yang Pan,Helmut Bölcskei
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Classical Analysis and ODEs (math.CA); Dynamical Systems (math.DS)
*备注:
Abstract:Learning governing equations from observed solution data is a fundamental challenge in scientific machine learning \citebruntonDiscoveringGoverningEquations2016,kovachkiNeuralOperatorLearning2023,longPDENetLearningPDEs2018,rudyDatadrivenDiscoveryPartial2017,raonicConvolutionalNeuralOperators2023, yet the theoretical conditions under which a ground-truth ODE can be uniquely and stably identified from multiple solution observations remain largely undeveloped, and no quantitative analysis of the sample complexity of such learning tasks exists in the literature. To address this gap, we introduce the Hausdorff distance on solution sets as the natural metric for comparing differential equations, since it captures the worst-case separation between two equations over all admissible initial conditions and thus encodes the minimax structure of the identification problem. We establish identifiability bounds for governing ODEs across a wide class of structure equations–ranging from linear ODEs to nonlinear classes with Lipschitz (Hölder)-continuous vector fields–characterizing precisely when two distinct equations can be distinguished from solution data. Using this metric, we derive metric entropy estimates for the relevant ODE classes and analyze sample complexity bounds, quantifying how many solution observations are needed to reliably recover the governing equation.
[LG-7] How Good Can Linear Models Be for Time-Series Forecasting?
链接: https://arxiv.org/abs/2606.27282
作者: Lang Huang,Jinglue Xu,Luke Darlow
类目: Machine Learning (cs.LG)
*备注: 17 pages, 10 figures, and 5 tables
Abstract:Time-series forecasting research has been moving steadily toward larger architectures, from specialized transformers to general-purpose foundation models, on the assumption that capacity is what unlocks accuracy. We take the opposite position: most of the gap can be closed at far lower cost by tuning preprocessing rather than scaling models. We use Ridge regression as the testbed, since it has a closed-form solution and interpretable weights, which let the optimal hyperparameters be read off the search directly. We search over context length, local normalization, regularization, and augmentation on eight standard benchmarks and find three patterns. (1) Optimal lookback is strongly series-specific and often non-monotonic in forecast horizon, with fitted power-law exponents ranging from +0.46 on ETTm2 to -0.19 on Exchange and Traffic, challenging the convention that longer horizons need longer history. (2) Normalizing over a learned trailing fraction of the context, rather than its entirety, is almost universally preferred. (3) Series within the same dataset often disagree on hyperparameters; the optimal degree of cross-series sharing varies from fully shared to fully per-series. The resulting models beat prior linear forecasters on most dataset-horizon entries and exceed Transformer, MLP, and CNN baselines on six of eight benchmarks. The optimized hyperparameters also serve as a diagnostic on the data itself, revealing structures that larger models absorb silently into their learned parameters.
[LG-8] BetXplain: An Explanation-Annotated Dataset for Detecting Manipulative Betting Advertisements on Social Media
链接: https://arxiv.org/abs/2606.27274
作者: MSVPJ Sathvik,Parmitha Vangapadu,Nishit Rane,Sathwik Narkedimilli,Mark Lee,Akrati Saxena
类目: Machine Learning (cs.LG)
*备注:
Abstract:The promotion of betting applications on social media platforms has increased significantly in recent years. Many of these advertisements use persuasive techniques that may mislead users, encourage risky behavior, and potentially influence users’ mental well-being. However, research on the automated detection of manipulative and deceptive betting advertisements remains limited due to the lack of publicly available annotated datasets. In this work, we introduce a new dataset of betting-related advertisements collected from two widely used social media platforms, Instagram and Reddit. The advertisements were manually annotated for manipulative and deceptive advertising practices. In addition to classification labels, the dataset includes human-provided explanations that describe the reasoning behind each annotation, enabling research into explainable approaches to detecting manipulative advertising. Furthermore, we analyze the strategies commonly used in betting advertisements and examine how these persuasive tactics may impact users’ mental health. The proposed framework can also enable practical applications such as browser plugins that warn users about manipulative betting advertisements and automated web crawlers that help regulatory authorities monitor and detect such promotions online.
[LG-9] RSPC: A Benchmark for Modeling Stress and Psychiatric Conditions in Digitally Mediated Relationships using Psychiatrist Annotations
链接: https://arxiv.org/abs/2606.27247
作者: Parmitha Vangapandu,Sai Ganesh Mokkapati,Sathwik Narkedimilli,MSVPJ Sathvik,Timothy Liu,Simon See,Johannes C. Eichstaedt
类目: Machine Learning (cs.LG)
*备注:
Abstract:In NLP, mental health conditions are often modeled as isolated phenomena, without interpersonal context. We use Reddit posts about long-distance relationships to capture both mental health distress and associated relational triggers. We introduce the Relational Stress and Psychiatry Corpus (RSPC) containing 1,799 Reddit posts annotated by psychiatrists for diagnostic categories, including the most prevalent mood disorders (anxiety and depression), relational stressor triggers, and indications of relationship phase. We benchmark seven fine-tuned transformer models and five large language models across multi-label disorder classification, relational trigger detection, and temporal phase prediction tasks. We find clear task-dependent differences between model families, with Claude-3-Haiku achieving the best disorder classification performance (Macro-F1 = 0.538) and GPT-4o obtaining the strongest relational trigger detection performance (Macro-F1 = 0.519), suggesting distinct model capabilities. We further find strong associations between anxiety disorders and chronic relational uncertainty. Overall, RSPC establishes a benchmark for NLP tasks that consider relational context and supports a shift from individual-centric to context-aware mental health modeling that captures the social and temporal dynamics of distress.
[LG-10] Effective Covariance Dynamics in Solvable High-Dimensional GANs
链接: https://arxiv.org/abs/2606.27246
作者: Andrew Bond,Zafer Doğan
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study a solvable high-dimensional model of generative adversarial network (GAN) training in which a linear generator learns a low-dimensional subspace from data with structured latent covariance. Prior solvable GAN analyses assume unconditional signals with diagonal latent covariance; we extend the multi-feature discriminator setting to class-dependent, correlated, and non-zero-mean latent structure. For the quadratic energy discriminator, all such heterogeneity enters the dynamics through a probability-weighted effective second moment. We prove that the stochastic microscopic training process converges, in the high-dimensional limit, to deterministic ordinary differential equations governed by this effective covariance. In the matched-covariance specialization, the stability analysis yields a mode-wise solvable interval determined by the learning rates and noise level: learning begins when the leading effective eigenvalue crosses the lower threshold, while full recovery requires all relevant effective modes to remain within the interval. This reveals a signal-boosting mechanism: low-rank correlations can lift weak directions above the learnability threshold, whereas overly strong correlations destabilize recovery. Numerical simulations validate the ODE, phase boundary, and boosting mechanism. Experiments on MNIST, FashionMNIST, and CIFAR-10 further show that informed generator covariance improves alignment with the data-driven reference subspace.
[LG-11] Hierarchical Muon: Tiled Newton-Schulz Updates for Efficient Muon Optimization
链接: https://arxiv.org/abs/2606.27216
作者: Ziyuan Tang,Tianshi Xu,Yousef Saad,Yuanzhe Xi
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 23 pages, 10 figures, 3 tables
Abstract:Muon-type optimizers construct update directions for dense neural-network weights by applying a finite Newton-Schulz map to momentum-gradient matrices. For an H \times W matrix, with r=\min\H,W\ and s=\max\H,W\ , K steps of the full-matrix Newton-Schulz update require O(r^2 s K) work and couple all rows and columns through repeated Gram matrix products. We introduce Hierarchical Muon (HiMuon), a tiled Newton-Schulz scheme for Muon-type optimization. HiMuon partitions each momentum-gradient matrix into T \times T tiles, applies the same finite Newton-Schulz map independently to each tile, and reassembles the results. For finite T below the matrix dimensions, HiMuon defines a local matrix-function map rather than a convergent approximation to the full-matrix update: spectral interactions are preserved within tiles and discarded across tile boundaries. For fixed finite T , the leading Newton-Schulz work decreases to O(H W T K) , and the computation decomposes into independent small dense matrix operations. This structure enables tile-size-dependent GPU kernels, cross-layer batching, memory-bounded chunking, and runtime tile-size schedules. Experiments on transformer training and controlled matrix-function diagnostics show that HiMuon improves optimizer-step efficiency while keeping training behavior close to full-matrix Muon in the tested regimes.
[LG-12] Graph Neural Networks Applications Across Domains: All Insights You Need
链接: https://arxiv.org/abs/2606.27202
作者: Abderaouf Bahi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph neural networks have moved from a niche representation-learning technique to the default model class wherever data carry relational structure. The interesting question is no longer whether message passing helps on a given dataset, but where graph structure earns its computational cost and where it does not. This survey organises the field around a single design space, derives the spectral and spatial formulations from shared first principles, and connects expressive power to the Weisfeiler-Leman hierarchy with explicit statements of what current architectures can and cannot separate. Against that methodological backbone we examine twelve application domains, among them recommendation and social networks, knowledge graphs and language-model integration, drug discovery and molecular property learning, healthcare and neuroscience, computer vision, traffic and urban computing, power and renewable-energy systems, wireless and sixth-generation networks, fraud and cybersecurity, industrial prognostics, materials science, and climate modelling. For each domain we specify the graph-construction choices and their costs, identify which architecture families dominate and why, and separate reported gains from artefacts of weak baselines or favourable splits. A cross-domain comparison exposes recurring patterns: heterophily and scale undercut the same models almost everywhere, temporal graphs remain harder than their static counterparts, and the architectures that top public leaderboards are seldom the ones that reach deployment. We treat over-smoothing, over-squashing, robustness, distribution shift, fairness, and explainability not as a closing checklist but as the constraints that decide adoption.
[LG-13] Explaining Temporal Graph Neural Networks via Feature-induced Information Flow
链接: https://arxiv.org/abs/2606.27201
作者: Ping Xiong,Thomas Schnake,Klaus-Robert Müller,Shinichi Nakajima
类目: Machine Learning (cs.LG)
*备注:
Abstract:Event-based Temporal Graph Neural Networks (ETGNNs) have demonstrated strong performance across a wide range of applications, including social network analysis, epidemic tracing, recommender systems, and political event forecasting. However, their increasing complexity poses significant challenges for explainability. Existing explanation methods focus only on a subset of the information flow within ETGNNs, typically tracing contributions from the event-related embeddings to the output. Consequently, they overlook the important pathways through event-induced variables, which mediate interactions between nodes and thereby play a central role in capturing long-range temporal dependencies. To overcome this limitation, we propose a novel attribution method that analyzes the \emphentire information flow through all event-associated variables. Our method is built upon the recent Normalized Relevance Measure (NRM) framework, which enables explicit quantification of information flow originating from event embeddings as well as information flow passing through event-induced variables. It also ensures comparability of latent variables across layers, and supports higher-order analysis of interactions between events. To handle the architectural complexity of ETGNNs, we extend the NRM framework with a modular decomposition procedure that facilitates the systematic construction of relevance structure for complex neural architectures. We evaluate our approach on two synthetic datasets for epidemic tracing and social dynamics, as well as a real-world dataset of political event networks. Our qualitative and quantitative experiments show that our method consistently outperforms existing explanation approaches while producing more human-interpretable explanations.
[LG-14] RecallRisk-BERT: A Multi-Task Framework for Post-Report Medical Device Recall Triage
链接: https://arxiv.org/abs/2606.27174
作者: Ali Semih Atalay,Sevgi Yigit-Sert
类目: Machine Learning (cs.LG)
*备注:
Abstract:Medical device recalls are a critical regulatory mechanism for protecting patient safety. The growing volume of FDA recall records presents challenges in post-report recall triage, severity assessment, and root-cause interpretation. Existing studies mostly address recall occurrence prediction or root-cause analysis separately, while joint modeling of recall severity and root-cause categories has received limited attention. We develop an automated recall triage framework using 54,165 FDA medical device recall records from openFDA, covering the period from 2002 to October 2025. We first evaluate classical machine learning and boosting-based models for recall severity and root-cause category prediction. We then develop RecallRisk-BERT, a multi-task model that combines PubMedBERT-based textual representations of recall narratives with embedding-based representations of structured categorical features, including product code, regulation number, and medical specialty. The model simultaneously predicts recall severity (Class I/II/III) and a consolidated root-cause category (9 classes). Performance was evaluated using accuracy, macro-averaged precision, recall, F1-score, and ROC-AUC. In single-task severity prediction, our LightGBM-based text–tabular configuration achieved the strongest performance, with an accuracy of 0.963, macro-F1 of 0.856, and ROC-AUC of 0.974. In the multi-task setting, RecallRisk-BERT substantially outperformed the single-task PubMedBERT baseline. Model-derived risk rankings were strongly consistent with observed root-cause severity patterns (rho = 0.983, p = 1.936e-6). These findings indicate that text–tabular learning can support scalable post-report recall triage, regulatory decision support, and model-based root-cause risk analysis.
[LG-15] Stochastic Gradient Optimization with Model-Assisted Sampling
链接: https://arxiv.org/abs/2606.27171
作者: Jonne Pohjankukka,Jukka Heikkonen
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 24 pages, 11 figures, 4 tables
Abstract:This work addresses the problem of variance in stochastic gradient estimation for machine learning optimization. Deep learning relies on mini-batch methods such as stochastic gradient descent, which approximate full gradients but introduce noise, creating trade-offs between convergence stability, speed, and generalization. Existing methods, including variance reduction techniques (e.g., SVRG and SAG) and adaptive optimizers, aim to mitigate gradient noise but may introduce additional computational overhead. We propose a model-assisted sampling framework that interprets mini-batch gradients through survey sampling theory, treating the dataset as a fixed finite population and gradients as sample-based estimates. Our aim is to bridge machine learning optimization and survey sampling theory by combining their perspectives on sample-based estimation and variance reduction. By incorporating auxiliary gradient-prediction models, we construct more efficient gradient estimators, with uniform sampling arising as a special case when no auxiliary information is used. Our approach integrates easily with existing optimizers, improving efficiency without altering their dynamics. Empirical results on synthetic and six benchmark datasets show performance gains in 71-86% of the experiments, particularly for medium-sized input spaces in our benchmarks. Notably, with momentum-based optimizers such as AdamW, the proposed estimator achieves clearly better generalization in roughly half the training epochs compared to baseline estimator.
[LG-16] DMuon: Efficient Distributed Muon Training with Near-Adam Overhead
链接: https://arxiv.org/abs/2606.27153
作者: Vincent Chen,Starrick Liu,Regis Cheng,Dance Yang,Shalfun Li,Ryan Yu,Lucy Liang,Hang Su,Roy Gan,Hao Wang,Qian Wang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Matrix-orthogonalization-based optimizers, exemplified by Muon, have demonstrated strong convergence behavior across a wide range of modern deep learning workloads. The matrix-aware updates offer a compelling alternative to conventional element-wise optimization, particularly as model architectures continue to grow in scale and heterogeneity. Yet contemporary distributed training infrastructure built around the assumption of element-wise optimizers is poorly matched to matrix-level optimizers such as Muon, whose updates couple entire weight matrices and require costly Newton-Schulz iterations. Vanilla Muon implementations incur more than 2x the cost of forward and backward passes. To close this gap, we present DMuon, an open-source distributed Muon implementation that integrates into existing training pipelines as a drop-in module, with no framework-level modifications. Across both embodied foundation model and large language model (LLM) training workloads, DMuon achieves a 1.48x-3.01x speedup in end-to-end step time and a 6.85x-163.00x speedup in optimizer-step time, bringing per-step latency to near-AdamW levels and enabling efficient scaling in our model training.
[LG-17] fTNN: a tensor neural network for fractional PDEs
链接: https://arxiv.org/abs/2606.27140
作者: Qingkui Ma,Hehu Xie,Xiaobo Yin
类目: Machine Learning (cs.LG)
*备注: 30 pages,11 figures and 12 tables
Abstract:We develop the fTNN, a deterministic tensor neural network subspace method for problems involving the fractional Laplacian on bounded domains, taking the fractional Poisson equation and time-dependent fractional advection-diffusion equation as typical representatives. The work employs a geometry-adapted integration split featuring a spatially dependent near-field radius, which decomposes the fractional Laplacian into three contributions: a singular near field, a regular interior far field, and an analytical exterior far field. Then the singular radial integrals are treated by Gauss-Jacobi quadrature, the regular radial integrals by Gauss quadrature, and the angular variables by deterministic angular quadrature, yielding a fully deterministic integration framework of the fractional Laplacian operator. To accurately resolve low-regularity solutions and the associated loss functional, we construct boundary-singularity-aware trial functions enriched with explicit boundary features, and propose two strategies for automatically selecting the leading exponent and evaluating the loss function from the singularity structure induced by the fractional operator, or jointly by the fractional operator and the source term. For time-dependent fractional PDEs, we design a spatiotemporally separable neural network that factorizes the time-space residual into a sum of low-dimensional temporal and spatial integrals, and we integrate this representation with an alternating neural network subspace optimization strategy for efficient training. Numerical experiments show that the proposed framework attains high accuracy on the tested benchmarks and improves substantially over existing fPINN and Monte Carlo baselines, particularly for problems with strong boundary singularities and long-time simulations.
[LG-18] Kolmogorov Arnold networks (KAN) for aerodynamic prediction: a comparison with MLPs and GNNs
链接: https://arxiv.org/abs/2606.27126
作者: Miguel Jaraiz,Fermin Gutierrez,Pablo Yeste,Miguel Sánchez-Domínguez,Eusebio Valero,Gonzalo Rubio,Lucas Lacasa
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:Kolmogorov Arnold networks (KAN) have recently been introduced as a (deep) neural network architecture whose trainable parameters adapt the activation functions, instead of the coefficients of the affine transformations at the core of traditional architectures such as deep multilayer perceptrons (MLPs). This architecture builds on the Kolmogorov-Arnold theorem, which endows it with universal approximation properties. While the advent of KANs has been received with excitement, there is a current debate about the possible KAN supremacy over deep multilayer perceptrons (MLPs) for classic fields such as symbolic regression, generic-purpose machine learning, natural language processing or computer vision. Here we assess the performance of KANs --and its nuanced comparison against MLPs and graph neural networks (GNNs)-- in the realm of fluid dynamics surrogate modelling. To that aim, we consider the task of predicting the surface pressure distribution over subsonic and transonic airfoils, a canonical task in aerodynamics. Our results show that KAN models show good performance in predicting the whole pressure coefficients and is able to interpolate across Mach numbers and angles of attack, however its performance is comparable --marginally inferior-- to a suitably trained MLP, where best performance is achieved by a GNN at the expense or requiring lengthier training. While the optimal KAN model have typically much lower complexity than MLP and GNN --hence resulting in faster training–, we find that KANs suffer from training instabilities, and their performance is highly dependent on a proper hyperparameter optimisation.
[LG-19] Cross-Head Attention Uplift Network with Inverse Propensity Score under Unobserved Confounding
链接: https://arxiv.org/abs/2606.27114
作者: Haoran Zhang,Chuanpu Li,Yuxin Fu,Bin Tong,Guan Wang,Bo Zheng,Feng Zhou
类目: Machine Learning (cs.LG)
*备注:
Abstract:Uplift modeling, crucial for estimating individual treatment effects (ITE), faces dual challenges: flexibly leveraging inter-group similarity to enhance discriminative power and debiasing under unobserved confounding scenarios. In this paper, we propose the Cross-Head Attention Uplift Network (CHAUN) and Robust Adversarial Inverse Propensity Score (RA-IPS) method to address these limitations. CHAUN employs shared feature embeddings and cross-head attention mechanisms to dynamically integrate treatment-specific and control-specific representations, enhancing inter-group correlation modeling. Theoretically, we prove that access to the true propensity scores ensures ITE identifiability even with unobserved confounders. For practical scenarios lacking true propensity scores, RA-IPS adversarially optimizes propensity weights within constrained uncertainty sets to mitigate bias from unobserved variables. Experiments on public datasets (CRITEO-UPLIFT, LAZADA) and a production e-commerce dataset demonstrate CHAUN’s superiority over state-of-the-art uplift models, achieving relative improvements of up to 25.6% in QINI scores. RA-IPS further enhances robustness, outperforming standard IPS by 5.4% under unobserved confounding. The results validate the effectiveness of our proposed methods in real-world causal inference tasks.
[LG-20] ransformer-Based Classification of Bacterial Raman Spectra with LOOCV
链接: https://arxiv.org/abs/2606.27096
作者: Jamile Mohammad Jafari,Thomas Bocklitz
类目: Machine Learning (cs.LG)
*备注:
Abstract:Transformer-based models have recently attracted increasing attention for Raman spectral classification. In this study, a transformer-based approach was systematically evaluated using a nested leave-one-replicate-out cross-validation framework and compared with conventional machine-learning pipelines combining PCA or ICA with LDA, SVM, and Random Forest classifiers. A bacterial Raman dataset comprising 5,417 single-cell spectra from six bacterial species and nine independent measurement replicates was used. The transformer consistently achieved the highest classification performance across independent test replicates and significantly outperformed all conventional approaches. Analysis of the learned latent feature space revealed improved class separation compared with PCA- and ICA-based representations. Furthermore, the transformer maintained superior performance when applied directly to raw Raman spectra without preprocessing, demonstrating robust behavior across measurement replicates. These findings highlight the potential of transformer-based models for robust Raman spectral classification and emphasize the importance of replicate-aware validation for realistic model evaluation.
[LG-21] Finding Stationary Points by Comparisons ICML2026
链接: https://arxiv.org/abs/2606.27082
作者: Helin Wang,Chenyi Zhang,Xiwen Tao,Yexin Zhang,Tongyang Li
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC); Quantum Physics (quant-ph)
*备注: 41 pages, 4 figures. To appear in the Forty-Third International Conference on Machine Learning (ICML 2026)
Abstract:We study the problem of finding stationary points of non-convex functions when access to the objective is provided only through a comparison oracle that, given two points, outputs which has the larger function value. For a twice differentiable f\colon\mathbb R^n\to\mathbb R with Lipschitz gradient and Hessian, we develop an algorithm that visits an \epsilon -stationary point using \widetilde O(n^2/\epsilon^1.5) queries. Our approach uses a subroutine that estimates the normalized Hessian to accuracy \delta using \widetilde O(n^2\log(1/\delta)) queries. We further study this problem with a quantum comparison oracle model where queries can be made in superpositions, and develop the first quantum algorithm that finds an \epsilon -stationary point, which takes \widetilde O(n/\epsilon^1.5) queries.
[LG-22] Symplectic Neural Networks for learning Generalized Hamiltonians
链接: https://arxiv.org/abs/2606.27029
作者: Harsh Choudhary,Vyacheslav Kungurtsev,Chandan Gupta,Melvin Leok,Georgios Korpas
类目: Machine Learning (cs.LG)
*备注:
Abstract:Hamiltonian Neural Networks (HNNs) integrate physical priors into neural models by learning a system’s Hamiltonian, improving generalization and sample efficiency. Identifying the system Hamiltonian from noisy observations of state variables is a challenging task. For simulations to faithfully reflect the long-term behavior of Hamiltonian systems, especially energy conservation, it is essential to use symplectic integrators, which preserve the system’s geometric structure. This fidelity comes at a cost: implicit symplectic integrators are more computationally intensive and make backpropagation through the ODE solver non-trivial. However, by leveraging the fact that symplectic discretizations of the adjoint system yield the same sensitivities associated by backpropagation, we obtain an efficient method of training the Neural Network parameters. In our work, we explore this alternate method of HNN training under noisy observation of trajectories with our HNN model based on an implicit symplectic integrator. Computationally, a predictor-corrector based ODE solver and fixed point iteration help to mitigate the computational cost of the implicit timestepping, resulting in more efficient generation of gradient updates. We showcase the numerical advantage, in experiments, in system identification and energy preservation on a range of non-separable, chaotic systems and the efficient computation and memory complexity of our method. We also observe that the post-processing of the learned Hamiltonian using backward error analysis yields a modified Hamiltonian that is a more accurate approximation of the true Hamiltonian without the need to use more accurate discretizations of the flow map.
[LG-23] A Generalization Theory for JEPA-Based World Models
链接: https://arxiv.org/abs/2606.27014
作者: Jingyi Cui,Qi Zhang,Hongwei Wen,Yisen Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Joint Embedding Predictive Architectures (JEPAs) have recently emerged as a promising paradigm for world modeling by learning predictive dynamics in a latent space rather than generating future observations at the input level. Despite their empirical success, the theoretical understanding of JEPA-based world models remains limited. In this paper, we develop the first generalization theory for JEPA-based world models. We formulate JEPA pretraining as a conditional spectral graph learning problem and show that the JEPA objective is equivalent to a low-rank factorization of an action-conditioned co-occurrence matrix. Building on this characterization, we establish a connection between JEPA pretraining error and downstream planning regret, leading to a finite-sample generalization bound for JEPA-based world models. Our analysis reveals an inherent trade-off between approximation and sample errors with respect to the latent dimension, providing theoretical insights into the advantages and limitations of latent predictive models compared with input-level predictive approaches.
[LG-24] Uncertainty quantification via conformal prediction in data assimilation
链接: https://arxiv.org/abs/2606.27001
作者: Catherine George,Alireza Javanmardi,Tijana Janjić,Eyke Hüllermeier
类目: Machine Learning (cs.LG)
*备注: Submitted to Quarterly Journal of the Royal Meteorological Society
Abstract:Quantifying the evolution of uncertainty is critical to both probabilistic forecasting and data assimilation in numerical weather prediction. In this study, we investigate the applicability of conformal prediction (CP), a recent machine learning (ML) method, to quantify uncertainty in a controlled, idealized setting. We use the one dimensional modified shallow water model, designed to mimic the convective process. CP provides a set of possible outcomes with a chosen confidence level. Here, we compare and evaluate the average empirical coverage, the average interval length, miss low, miss high and average interval score loss (AISL) for three variants of CP, namely a) Standard CP, b) Normalized CP and c) Conformalized Quantile Regression. We further compare these CP-based uncertainty estimates with traditional ensemble-based measures such as standard deviation intervals and ensemble spread. In addition, we investigate the integration of CP-derived uncertainty within the data assimilation cycle through CP perturbations. Our results highlight the strengths and limitations of each approach, providing insight into the effectiveness of CP to complement common ensemble-based uncertainty quantification in simplified atmospheric models.
[LG-25] RolloutPipe: Overlapping Pipelined Rollout and Training in Disaggregated On-Policy LLM Reinforcement Learning
链接: https://arxiv.org/abs/2606.26997
作者: Rongjian Chen,Jianmin Hu,Kejiang Ye,Minxian Xu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 15 pages
Abstract:Large language model (LLM) post-training for reasoning increasingly relies on reinforcement learning with verifiable rewards (RLVR), where models learn from ground-truth feedback on mathematical, logical, and scientific tasks. To enable flexible resource allocation and support heterogeneous training setups, modern RLVR systems adopt disaggregated architectures that decouple rollout generation and policy training across independent GPU pools. However, existing synchronous on-policy GRPO (Group Relative Policy Optimization) RLVR systems finish an entire rollout before starting training, leaving the trainer GPU pool idle while rollout is still ongoing. Asynchronous RL pipelines overlap the two stages, but at the cost of training on stale data. To address these challenges, we propose RolloutPipe, a post-training framework for disaggregated RLVR systems, which turns the fixed-weight rollout into a complete-group pipeline where trainable groups move to the trainer while later groups are still being generated. RolloutPipe achieves this through two techniques including complete-group pipelining (CGP) and frontier-group dispatch (FGD). CGP dispatches each trainable complete group to the trainer FIFO as soon as group materialization finishes, and FGD is an admission policy on the Rollout node that first admits requests for the frontier groups needed to form the next training batch, so that trainer-ready groups arrive earlier and more steadily. The design starts training before the rollout completes while maintaining on-policy correctness. Evaluated on Qwen3-1.7B across four reasoning and science benchmarks and twelve rollout settings, RolloutPipe shortens the rollout-to-train-end time by 30.7%-42.3%, and lowers the trainer waiting ratio by 37%-76% compared to Slime, a state-of-the-art rollout and training system.
[LG-26] Asymptotically Optimal Learning for Parametric Prophet Inequalities
链接: https://arxiv.org/abs/2606.26893
作者: Jung-hun Kim,Anna Grebennikova,Vianney Perchet
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study learning in prophet inequalities with i.i.d. rewards drawn from an exponential-type parametric family with an unknown parameter \theta , a class that includes exponential, Pareto, and bounded-support power-family distributions. We first characterize the optimal full-information asymptotic competitive ratio for this family. In the unbounded-support case, the limit is \left(\theta/(\theta-c_+)\right)^c_+/\theta/ \Gamma(1-c_+/\theta), while in the bounded-support case, the limit is 1 . We then propose a confidence-based dynamic-programming policy for online learning. By exploiting the explicit parametric structure, the policy achieves the same optimal asymptotic competitive ratio using only online observations, without external offline samples. We further derive distribution-specific convergence rates for canonical examples. Finally, numerical experiments on synthetic instances illustrate the performance of our algorithm.
[LG-27] Accelerated sampling using SamAdams variable timesteps and position-adaptive Langevin dynamics
链接: https://arxiv.org/abs/2606.26881
作者: Benedict Leimkuhler,Peter A. Whalley
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Computation (stat.CO)
*备注:
Abstract:We introduce an accelerated Langevin-based sampling method that is based on two complementary devices: \emphSamAdams adaptive timestepping, which automatically shrinks the effective integration step in stiff regions of phase space using a relaxed stiffness monitor, and \emphposition-adaptive Langevin (PAL) dynamics, which concentrates friction along the local force direction while preserving the canonical distribution as the exact invariant measure. The resulting combined scheme (SA-PAL) is implemented in a palindromic integrator which requires only one force evaluation per iteration through suitable organisation of the integration steps and by exploiting the rank-one-plus-scalar structure of the PAL friction tensor. We test the method on various model problems: the Rosenbrock function, a thin entropic channel, the Mueller-Brown potential, and a Bayesian parameterisation problem with a sparsity-inducing shrinkage prior. On the Rosenbrock and Mueller-Brown potentials mixing rates are improved by 1.5-3 times compared to fixed stepsize integration. Efficiency gains of more than an order of magnitude are documented in the other examples.
[LG-28] Quantization in Federated Learning: Methods Challenges and Future Directions
链接: https://arxiv.org/abs/2606.26822
作者: Farwa Ikram,Dipanwita Thakur,Antonella Guzzo,Giancarlo Fortino
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated Learning (FL) has become a foundational paradigm for privacy-preserving distributed intelligence, yet its scalability remains fundamentally constrained by communication bottlenecks, device heterogeneity, and the challenges of training under statistically non-IID data. Quantization is one of the most effective mechanisms for mitigating these limitations, reducing both uplink/downlink payloads and on-device computation. This paper provides the first FL-centric systematic review of quantization, introducing a novel taxonomy organized around FL-specific dimensions, including client heterogeneity, aggregation consistency, communication-scheduling adaptation, non-IID robustness, privacy/security integration, and hardware/energy co-optimization. Beyond cataloging existing methods, we analyze how quantization interacts with core FL behaviors such as client drift, partial participation, convergence stability, secure aggregation, and differential privacy. We further identify cross-method insights, open research gaps, and design guidelines for practitioners deploying quantized FL on mobile, IoT, and edge platforms. This survey thus establishes quantization not merely as a compression technique, but as a fundamental systems component shaping the performance, robustness, and practicality of modern FL.
[LG-29] Reasoning Quality Emerges Early: Data Curation for Reasoning Models ICML2026
链接: https://arxiv.org/abs/2606.26797
作者: Hongyi Henry Jin,Wenhan Yang,Meysam Ghaffari,Carlos Morato,Baharan Mirzasoleiman
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML 2026 (Poster)
Abstract:Supervised fine-tuning (SFT) on a small, high-quality set of long reasoning traces is an effective approach for eliciting strong reasoning capabilities in Large Language Models (LLMs). However, existing methods for curating high-quality SFT data rely heavily on strong reasoning models to filter examples based on diversity and difficulty, making the curation process costly while often yielding suboptimal data quality. In this work, we show that diverse and challenging reasoning examples can be identified using only the initial reasoning tokens. Specifically, we demonstrate that difficult problems can be reliably detected based on the loss of the first 100 reasoning tokens evaluated at a randomly perturbed checkpoint of the pretrained model. We further show that examples exhibiting similar loss patterns over their first 1k reasoning tokens across a small number of perturbed checkpoints extrapolating along the fine-tuning trajectory provably induce similar gradients. We validate our approach through extensive experiments on fine-tuning Qwen2.5-7B and Llama3.1-8B models on the M23K medical reasoning and OpenThoughts-Math datasets. Our method outperforms existing baselines by up to 1.7% while being 91% more token efficient.
[LG-30] Escaping Iterative Parameter-Space Noise: Differentially Private Learning with a Hypernetwork
链接: https://arxiv.org/abs/2606.26772
作者: Naoki Nishikawa,Shokichi Takakura,Satoshi Hasegawa
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Differentially private (DP) training of neural networks is often hindered by the large amount of noise required by gradient-based methods such as DP-SGD, which repeatedly inject high-dimensional noise in parameter space throughout training. In this paper, we propose a new framework for DP learning that avoids iterative optimization in parameter space. Instead of updating the target model using privatized gradients, we employ a hypernetwork trained on public datasets to map a private dataset to the parameters of the target model. Specifically, each example is embedded into a low-dimensional representation, the embeddings are aggregated and perturbed to obtain a DP dataset embedding, and the hypernetwork generates the target model parameters from this noisy embedding. Because privacy noise is injected only once into a low-dimensional dataset representation, our approach can significantly reduce the adverse effect of noise. We theoretically show in a synthetic setting that, under a fixed privacy budget, models produced by our approach achieve higher utility than those trained with DP-SGD. Moreover, we apply our approach to LoRA fine-tuning of diffusion models and show that it achieves lower FID than LoRA models trained with DP-SGD and other public-data-guided methods.
[LG-31] Batch-Invariant Spectral Intelligence for Robust and Explainable Insect Authentication
链接: https://arxiv.org/abs/2606.26757
作者: Majharulislam Babor,Giacomo Rossi,Annalisa Altavilla,Oliver Schlüter,Marina M.-C. Höhne
类目: Machine Learning (cs.LG)
*备注: 20 pages, 6 figures, 5 tables (excluding supplementary materials, submitted to journal
Abstract:Edible insects offer an efficient source of alternative protein, requiring less land, water and emitting less greenhouse gas than conventional livestock. However, their successful integration into the food supply chain demands reliable species authentication to control allergen exposure, prevent adulteration, and meet regulatory standards. Near-infrared spectroscopy provides a rapid analytical tool, but its performance drops when applied to production batches unseen during training due to batch-to-batch variation in spectral measurements. We introduce the Batch-Invariant Spectral Network (BISN), an end-to-end framework that combines a learnable preprocessing module, initialised with Savitzky-Golay filtering, with an entropy-regularised adversarial objective to suppress batch-specific spectral variation. In contrast to Domain-Adversarial Neural Networks, which enforce domain adaptation only after feature extraction, BISN suppress batch-effects before species-specific features are learned. Using 2,700 spectra from three species (Acheta domesticus, Hermetia illucens, and Tenebrio molitor) collected across three independent production batches, BISN achieves a mean leave-one-batch-out accuracy of 0.93 (standard deviation 0.04), outperforming the strongest baseline by four percent. Further insights gained by using explainable AI confirm that model decisions consistently rely on the lipid and protein absorption regions across all folds, connecting predictive performance to known insect biochemistry. BISN addresses both cross-batch robustness and biochemical interpretability for automated insect species authentication under realistic industrial conditions. The source code and dataset are publicly available at this https URL.
[LG-32] DroidBreaker: Practical and Functional Problem-Space Attacks on Machine-Learning Android Malware Detectors
链接: https://arxiv.org/abs/2606.26707
作者: Christian Scano,Diego Soi,Angelo Sotgiu,Luca Demetrio,Davide Maiorca,Giorgio Giacinto,Fabio Roli,Battista Biggio
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Adversarial APKs are Android applications modified in the problem space to evade machine-learning malware detectors. In this work, we first show that, despite claims, existing problem-space attacks remain largely impractical. Most techniques leverage software transplantation to inject entire benign modules, introducing many side-effect features and often causing build-time failures. Fine-grained methods that inject only a narrow subset of components exhibit limited effectiveness, while those that also use obfuscation rely on brittle bytecode rewriting, producing APKs that are syntactically valid but semantically unusable. Prior work further overestimates attack success rates by running smoke tests that only validate installation and basic execution, without assessing whether the modified APK still preserves its intended behavior. To overcome these limitations, we present DROIDBREAKER, a practical (build-safe) and functional (semantics-preserving) problem-space attack framework that provides: (i) query-efficient white- and black-box attacks by manipulating only the APK components most influential to the target model; (ii) a set of fine-grained, build-safe manipulations (including injection and obfuscation of API calls, app modules, permissions, and URLs) with minimal side effects; and (iii) a semantics-preserving functionality test that enforces runtime equivalence by comparing execution logs and API-level traces between the initial and the modified APK. Evaluated on a recent corpus of Android applications, DROIDBREAKER achieves high evasion rates with few queries and minimal side effects in both white-box and black-box settings, and drastically reduces detections by commercial malware scanners hosted on VirusTotal.
[LG-33] PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs
链接: https://arxiv.org/abs/2606.26666
作者: Muhammad Ahmed
类目: Machine Learning (cs.LG)
*备注: 7 pages, 3 tables; workshop paper
Abstract:Autoregressive large language model (LLM) serving is increasingly limited by key-value (KV) cache movement rather than dense matrix multiplication. Modern paged-attention systems reduce KV-cache fragmentation and mature kernels such as FlashInfer provide highly optimized native-paged decode attention. However, the best single-kernel implementation is not always the best serving schedule: low-active long-context decode can under-utilize commodity GPUs, while mixed sequence lengths introduce a tension between many exact-length launches and coarse padded batches. We present PersistentKV, a native block-table decode attention engine and page-aware scheduling study for grouped-query attention (GQA). PersistentKV maps work by KV-head group, is designed to reuse K,V tiles across grouped query heads, supports native page tables, and adds a compact workqueue schedule that executes only non-empty row-KV-head-sequence-split tasks. On an RTX 3060 with FP16, page size 16, Hq=32, Hkv=8, d=128, and identical correctness tolerance against FlashInfer, a calibrated adaptive policy selects FlashInfer for small active batches, PersistentKV sequence splitting for B1 long-context steps, and PersistentKV workqueue scheduling for B8 long-context steps. With thresholds and split counts fixed on calibration traces, one held-out trace seed improves synchronized wall throughput by 1.063-1.265x on B8 bimodal, uniform, and Zipf-like workloads and by 1.399x on a B1 bucketed trace. On the B4 bimodal boundary case, the policy avoids the PersistentKV regression by selecting FlashInfer. These results identify a concrete systems niche for adaptive page-aware decode scheduling and show that work assignment, not only attention math, is a decisive serving-system variable.
[LG-34] arget-Aware Bandit Allocation for Scalable Surrogate Optimization in Chemical Space ICML2026
链接: https://arxiv.org/abs/2606.26657
作者: Mohammad Haddadnia,Yuvan Chali,Abhilash Jayaraj,Constance Kraay,Joana Reis,Felix Strieth-Kalthoff,Haribabu Arthanari
类目: Machine Learning (cs.LG)
*备注: ICML 2026
Abstract:Identifying high-utility candidates from massive discrete spaces under expensive evaluations is a recurring challenge across the sciences, with structure-based drug discovery as a prominent example. While surrogate-based optimization can increase sample efficiency by reducing the number of expensive evaluations, modern molecular libraries have reached billions to trillions of compounds, making full-library surrogate inference itself a major computational bottleneck. We introduce BOBa, a bandit-guided surrogate optimization framework that eliminates full-library inference by adaptively allocating computation across partitions of the action space. By treating partitions as arms in a multi-armed bandit, BOBa concentrates inference and evaluations on empirically promising partitions while maintaining principled exploration. Experiments on real-world synthesis-on-demand libraries demonstrate that optimism-under-uncertainty bandits, combined with meaningful action space partitioning, are essential for effective allocation of inference and evaluations. Our findings reveal a tunable tradeoff between screening performance and surrogate inference cost, which supports practical optimization over current libraries, and establishes a viable route to ultra-large library virtual screening.
[LG-35] Sketched Linear Contrastive Learning: Approximation Optimization and Statistical Scaling
链接: https://arxiv.org/abs/2606.26617
作者: Ziyan Chen,Zhongzhu Zhou,Ding-Xuan Zhou
类目: Machine Learning (cs.LG)
*备注: 34 pages, 4 figures
Abstract:Scaling laws describe how learning performance varies with model size, data size, and compute. While recent theoretical work has established scaling laws for sketched linear regression, much less is understood for contrastive representation learning. In this paper, we study a sketched linear model for contrastive learning under a paired Gaussian latent-variable setup. The learner observes only sketched views of two correlated variables and trains a bilinear contrastive score by full-batch empirical gradient descent. We analyze a Gaussian-negative quadratic contrastive surrogate under aligned power-law spectra and a contrastive source condition, where we derive a risk decomposition into irreducible risk, approximation error, GD bias, GD variance, and a cross term. The cross term is controlled by the bias and variance and therefore does not affect the upper-bound scaling. Our main theorem gives an explicit scaling law with respect to sketch dimension M , sample size N , and effective optimization horizon L_\mathrmeff\gamma . Compared with standard linear-regression scaling laws, the contrastive setting must learn interactions between two views, and this changes how optimization and finite-sample noise scale with model size, data, and training time. This provides a first theoretical step toward understanding scaling behavior in contrastive learning and gives guidance for balancing model size, data, and optimization compute.
[LG-36] Latent Diffusion Posterior Sampling with Surrogate Likelihood Guidance for PDE Inverse Problems
链接: https://arxiv.org/abs/2606.26592
作者: Yuanzhe Wang,Alexandre M. Tartakovsky
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:
Abstract:We propose latent-space diffusion posterior sampling (L-DPS), an approximate Bayesian framework for high-dimensional inverse problems governed by partial differential equations (PDEs). The method addresses three challenges in PDE-constrained inversion: implicit sample-based priors without tractable densities, high-dimensional spatially distributed parameters, and the high cost of repeated forward-model evaluations during posterior sampling. L-DPS combines a variational autoencoder, an unconditional latent diffusion model, diffusion posterior sampling, and a differentiable neural surrogate. The VAE maps the parameter field to a lower-dimensional latent space, the diffusion model learns an implicit prior score in this latent space, and DPS combines this learned prior with likelihood-based guidance. The likelihood gradient is evaluated through the decoder-surrogate composition, avoiding repeated calls to the full numerical PDE solver. We evaluate the method on an inverse Darcy flow problem with an unknown spatially distributed permeability field inferred from sparse and noisy pressure observations. L-DPS produces accurate and robust inverse solutions, reduces inference cost relative to full-space DPS, and outperforms amortized inverse baselines such as conditional latent diffusion and inverse FNO in sparse and noisy regimes. We further compare L-DPS with a KLE-MAP baseline and study mixed-prior generalization and the sensitivity of inversion accuracy to surrogate forward-model error.
[LG-37] Empirical Software Engineering TerraProbe: A Layered-Oracle Framework for Detecting Deceptive Fixes in LLM -Assisted Terraform
链接: https://arxiv.org/abs/2606.26590
作者: Manar Alsaid,Chimdumebi Nebolisa,Faris Abbas
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 34 pages, 12 figures, 14 tables. Journal-first manuscript submitted to Empirical Software Engineering. Primary classification: cs.SE; cross-list: cs.CR
Abstract:Security misconfigurations in Terraform Infrastructure-as-Code are a growing risk in cloud deployments, and large language models are increasingly used as automated repair agents. Existing evaluations often treat a repair as successful when the targeted static-analysis finding disappears, without checking planning validity, behavioral change, or security intent. This paper presents TerraProbe, a five-layer oracle framework for evaluating LLM-assisted Terraform security repair. We apply TerraProbe to 288 first-pass repairs generated by gemini-2.5-flash-lite, GPT-4o, and Claude 3.5 Sonnet across 68 real-world TerraDS modules and 28 controlled injected-defect modules. The results show that targeted Checkov removal overstates repair success. Although targeted removal reaches 83.3 percent for the primary model, full-scanner cleanliness drops to 10.4 percent, Terraform planning succeeds for 39.6 percent, and plan comparison is reachable for 38.5 percent. Human adjudication further shows that 71.4 percent of plan-compared real-world repairs are deceptive fixes that pass automated checks while leaving the underlying vulnerability in place. This pattern is statistically indistinguishable across the three models, with deceptive-fix rates from 57.1 percent to 71.4 percent and pairwise Fisher exact p-values above 0.10. The paper introduces a four-dimensional taxonomy of deceptive fixes, validated with Cohen kappa of 0.78 and Krippendorff alpha of 0.76. IAM permission analysis confirms that wildcard Resource grants persist in all nine CKV2 AWS 11 deceptive-fix cases. TerraProbe contributes an evaluation methodology, a replication package, and the Multi-Layer Oracle Evaluation framework for distinguishing intent-aligned security repairs from scanner-passing false successes.
[LG-38] Revisiting Action Factorization for Complex Action Spaces
链接: https://arxiv.org/abs/2606.26574
作者: Timothy Flavin,Sandip Sen
类目: Machine Learning (cs.LG)
*备注: 53 Pages, 37 Figures, 6 Tables, Target Journal/Venue: ACM Transactions on Autonomous and Adaptive Systems TAAS
Abstract:Many real-world control problems involve hybrid discrete-continuous action spaces. For example, steering and signaling in autonomous driving, and aiming and firing in robotics or video-games. Despite real-world hybrid factorization and reinforcement learning framework support for complex action spaces (e.g., Gymnasium, PettingZoo, TorchRL, SeedRL, Mujoco, etc), the default environments within those frameworks often implement uniform action space configurations (LunarLander, Walker2D, Cheetah, SMAC, SUMO, Ant, Atari). Landmark hybrid-action benchmarks (RoboCup 2D HFO, SC2LE, Platform, CARLA, etc) are mostly heavyweight or archival implementations originating from papers which test one or a small number of competing factorization methods on one kind of control. This article provides a cross-sectional study of factorization methods [independent networks, shared encoder, VDN, QPLEX, Joint, Auto-Regressive] on each of three families of algorithms [PPO, SAC, DQN] across three action spaces [discretized, hybrid, continuous] over four lightweight environments [Platform, hybrid-LunarLander, Hybrid-Shoot, CoopPush]. Accounting for some invalid pairings such as joint-continuous, we are left with 220 configurations to analyze each method. We provide two new C++ parallel gymnasium and petting-zoo compliant environments [CoopPush, Hybrid-Shoot] to isolate particular challenges such as state-dependent inter-action dependence. Finally, we introduce VDN-PPO and PPO-MIX which use a branching critic to assign credit to multi-headed PPO. These variants out-perform all other tested PPO factorizations. Our results suggest that branching dueling architectures balance compute and performance most effectively, with Auto-Regressive actions reaching the highest performance overall and native continuous SAC outperforming discrete and hybrid algorithms, albiet both at increased computational cost.
[LG-39] Can Large Language Models Reliably Code Qualitative Humanitarian Data? A Benchmark Study Against Human Expert Adjudication
链接: https://arxiv.org/abs/2606.26541
作者: Jerome Marston,Tino Kreutzer,Salomé Garnier,Ella Boone,Phuong N Pham,Patrick Vinck
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 34 pages, 4 tables, 3 Annexes
Abstract:Data from affected populations are crucial for informing humanitarian response, but their value depends on timely and consistent interpretation of nuanced accounts of need. Humanitarian organizations often lack the staff, time, and specialist expertise required to analyze this information at scale. Large language models (LLMs) may expand this capacity, but their reliability for coding qualitative humanitarian data has not been directly established. This benchmark study compares 46 LLMs to a human Gold Standard using 150 high-fidelity synthetic humanitarian transcripts. Evaluation combined inter-rater reliability testing with Krippendorff’s alpha, discrepancy analysis distinguishing correct, near-correct, and incorrect codes, and qualitative assessment across humanitarian-specific criteria including discrimination, complex needs hierarchies, and non-standard communication styles. The authors find that multiple LLMs can perform deductive coding at reliability levels comparable to experienced human coders, especially when structured prompts and reasoning-enabled configurations are used. At the same time, aggregate reliability metrics alone are insufficient for deployment decisions. Models varied in recognizing needs expressed indirectly, needs outside predefined categories, and protection-relevant concerns such as physical safety and discrimination. These findings suggest that LLMs can materially expand humanitarian analytical capacity, but not as substitutes for human judgment. Appropriate use requires structured codebooks, reasoning-enabled models, attention to theme-specific performance, and tiered oversight focused on categories where miscoding would have the greatest programmatic consequences. For sensitive humanitarian data, open-weights models deployed on self-hosted infrastructure may offer a viable path for combining analytical scalability with stronger data governance.
[LG-40] Sample-efficient Transfer Reinforcement Learning via Adaptive Reward Shaping and Policy-Ratio Reweighting Strategy
链接: https://arxiv.org/abs/2606.26527
作者: Wenjie Huang,Yang Li,Jingjia Teng,Mingwei Jin,Kai Song,Yougang Bian,Yongfu Li,Qisong Yang,Helai Huang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Transfer learning improves policy learning efficiency by reusing knowledge from source tasks, providing a feasible paradigm for safe and efficient autonomous highway lane changing decision-making. Existing methods frequently encounter transfer mismatch induced by distribution shifts between source and target domains, leading to training oscillation and performance decline. Besides, target domain adaptation depends on exploratory interactions, which struggles to guarantee training safety in safety-critical lane changing cases. To tackle these limitations, this paper proposes a safe transfer reinforcement learning framework for autonomous highway lane changing. First, we design an adaptive teacher intervention mechanism based on instantaneous safety cost to restrain risky exploration and fade intervention strength progressively, with theoretical analysis on return bounds for mixed behavior policy. This intervention also produces dual-source samples for joint training. Second, a teacher-guided safe transfer module embeds action evaluation information of teacher policy into student learning via reward shaping to boost training safety and efficiency, with teacher guidance decaying as policy safety rises. Third, a teacher-guided weighted optimization mechanism adjusts sample weights in policy optimization using a likelihood ratio factor to stabilize transfer performance. Experiments under varied traffic densities and validations on real-world NGSIM dataset reveal that our method surpasses baseline approaches by over 52.2% in safety and 5.0% in learning efficiency. Results verify the efficacy and robustness of our safety-aware transfer strategy for autonomous highway lane changing under various traffic conditions.
[LG-41] heory-Scale Auto-Formalization of Logics for Computer Science
链接: https://arxiv.org/abs/2606.26525
作者: Yuming Feng,Frederick Pu,One An,Osbert Bastani,Li Zhang,Jiani Huang,Xujie Si,Ziyang Li
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
*备注:
Abstract:Auto-formalization is critical for scalable formal verification, but existing progress largely focuses on isolated statements, while theory-scale auto-formalization, which coherently translates hundreds of interdependent definitions, lemmas, and theorems, remains open due to challenges in consistency, faithfulness, scalability, and correctness. In this paper, we introduce LCS-Bench, a stand-alone, theory-scale benchmark based on Logics for Computer Science. LCS-Bench is built through a novel semi-automated agentic pipeline that leverages concept graphs, formal signature planning, issue tracking, sorry-filling with counter-example search, complemented by faithfulness review from human experts. The resulting artifact covers 327 textbook items, over 4,076 Lean declarations, and more than 85K lines of Lean code. The dataset supports broad evaluation through a data engine that automatically derives five tracks of evaluation benchmarks, measuring different aspects of auto-formalization and theorem-proving capabilities. We also introduce a novel evaluation protocol featuring definitional equivalence checkers, enabling more fine-grained and faithful assessment. Through extensive evaluation on 14 models, we demonstrate that (1) LCS-Bench is of high quality, consistent, and faithful; (2) the benchmark is challenging, with state-of-the-art models achieving only 20.1% on auto-formalization tasks; and (3) our analysis reveals key findings regarding theory-scale auto-formalization and suggests promising directions for future work.
[LG-42] Learning Probabilistic Filters with Strictly Proper Scoring Rules
链接: https://arxiv.org/abs/2606.26497
作者: Eviatar Bach,Ricardo Baptista,Jochen Bröcker,Bohan Chen,Andrew Stuart
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Machine Learning (stat.ML)
*备注: 87 pages, 17 figures
Abstract:Bayesian filtering of partially and noisily observed dynamical systems seeks to infer the evolving conditional distribution of the state of a dynamical system, given observations, in an online fashion. This Bayesian filtering distribution is the natural object for uncertainty quantification, but it is rarely available as a supervised learning target. However, one can often use the forecast model to generate synthetic system trajectories, along with synthetic observations. We introduce the proper scoring ensemble filter (PSEF), an ensemble data assimilation method based on training an analysis map to approximate the filtering distribution using only synthetic state–observation trajectories. The analysis step is represented as a permutation-invariant, transformer-based map that takes as input a forecast ensemble and observations, producing an analysis ensemble. Training is based on strictly proper scoring rules – with the energy score used in our implementation – so that probabilistic accuracy is rewarded over the whole probability distribution. We prove that, under a realizability assumption, the population objective is minimized by the true Bayesian filtering distribution. We also derive the finite-ensemble empirical objective used in training and relate its single state–observation trajectory form to the population objective, using a mean-field consistency argument. Numerical experiments show that the learned filter accurately approximates challenging filtering distributions, including nonlinear, non-Gaussian, and multi-modal posteriors, and achieves stronger performance in data assimilation tasks than classical methods or learning-based methods with mean-squared-error objectives. For close-to-Gaussian problems, learning a correction to the EnKF is the best approach, while for highly non-Gaussian problems an end-to-end approach that discards this inductive bias is superior.
[LG-43] What Survives When You Compress a Recursive Reason er for the Edge?
链接: https://arxiv.org/abs/2606.26488
作者: Pearse Jim,Steven Kolawole,Opegbemi Matthias Busoye,Glory Bagai,Virginia Smith
类目: Machine Learning (cs.LG)
*备注: Preprint; in review
Abstract:Recursive reasoning models can solve complex structured tasks with only a few million parameters by repeatedly updating a latent state. Deploying these models on edge hardware requires significant compression, but unlike conventional sequence models, quantization errors compound across recursive reasoning cycles rather than across output tokens. As a result, standard intuitions about compression fail to apply. In this work, we ask what survives when recursive reasoners are compressed. Across a full precision sweep, three tasks, and two recursive architectures, we find that aggressive compression preserves local prediction but destroys global reasoning: cell accuracy holds while puzzle-exact accuracy collapses to zero under naive INT4 pruning, distillation, and linear attention alike. Token-level objectives, including quantization-aware training, cannot repair it. The collapse is architectural – it strikes MLP-mixing recursion but not attention on the same task – and we reverse it with per-channel calibrated INT4 without retraining. We also introduce carry-trajectory fidelity, the cosine similarity to the full-precision reasoning path, as a label-free signal that predicts this damage and its recovery before a task evaluation. The combined result is a deployment recipe: flash-streamed embeddings remove a 99.4MB bottleneck, INT8 at one cycle matches full-depth accuracy at 6x fewer FLOPs (8MB SoC), and calibrated INT4 fits a 4MB microcontroller.
[LG-44] When Does Quality-Aware Multimodal Fusion Matter? A Leakage-Safe Diagnostic for Decision-Level Dependence INTERSPEECH2026
链接: https://arxiv.org/abs/2606.26473
作者: Jaden Moon,Arvind Pillai,Andrew Campbell
类目: Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted to INTERSPEECH 2026. 5 pages, 1 figure, 5 tables
Abstract:Many multimodal systems estimate the reliability of each modality and weight their contributions to the final prediction. However, it remains unclear whether these scores influence model decisions or merely correlate with performance. We propose a simple diagnostic to test whether reliability information is used during inference. After training, the model and inputs are fixed while reliability scores are permuted across test examples. If predictions depend on these scores, performance should degrade. Experiments on StressID for stress recognition and CMU-MOSEI for sentiment analysis show that permuting reliability scores leaves performance unchanged despite substantial potential gains from selecting the best modality per example. In positive controls where reliability signals identify the correct modality, the same frozen fusion rules yield significant improvements, indicating that reliability signals influence fused decisions only when they reliably predict unimodal correctness.
[LG-45] A Causal Foundation Model for Structure and Outcome Prediction ICML
链接: https://arxiv.org/abs/2606.26467
作者: Max Zhu,Martino Mansoldo,Ching-Hao Wang,Stefan Groha
类目: Machine Learning (cs.LG)
*备注: 20 pages, 7 figures, 17 tables, 43rd ICML Workshop on Foundation Models for Structured Data
Abstract:We introduce TabPFN-CFM, a causal foundation model that can handle multiple causal problems. TabPFN-CFM predicts both causal structure and outcomes from observational data, supports queries on all three levels of Pearl’s Causal Hierarchy and uses known graph structure when available to improve predictions. TabPFN-CFM is trained on synthetic datasets, and generalises to real datasets, demonstrating improved performance over both structural and outcome prediction baselines.
[LG-46] Finding the Time to Think: Learning Planning Budgets in Real-Time RL
链接: https://arxiv.org/abs/2606.26463
作者: Aneesh Muppidi,Firas Darwish,Dylan Cope,João F. Henriques,Jakob Nicolaus Foerster
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deliberating takes time. In real-time settings, that time is not free. Standard reinforcement learning (RL) sidesteps this as the environment waits indefinitely for the agent’s decision. Instead, we study real-time RL environments where the environment progresses while waiting for the agent’s action. Building on prior real-time formalizations, we introduce variable-delay real-time RL, where the agent chooses how long to deliberate at each decision point since the environment progresses. For the planning agents we use, the right delay is state-dependent, and naively planning how long to plan can paralyze the agent. We instead approach this setting by training a lightweight gating policy on top of a planner to select state-dependent planning budgets. Across real-time Pac-Man, Tetris, Snake, Speed Hex, and Speed Go, our gating policy outperforms fixed-budget and heuristic baselines, and transfers to a real-time setup where the environment and agent run on two different GPUs.
[LG-47] Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM -Based GPU Kernel Optimization
链接: https://arxiv.org/abs/2606.26453
作者: Jiading Gai,Shuai Zhang,Kaj Bostrom,Jin Huang,Vihang Patil,Haoyang Fang,Bernie Wang,Huzefa Rangwala,George Karypis
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present KernelPro, a closed-loop multi-agent system that automatically generates, profiles, and iteratively optimizes GPU kernel code by integrating large language model (LLM) code generation with hardware profiler feedback and pluggable bottleneck detection tools. KernelPro introduces four contributions: (1) a semantic feedback operator that encodes expert heuristics as pluggable micro-profiling tools, transforming raw hardware metrics into actionable natural language guidance; (2) a two-stage tool invocation architecture where roofline-based bottleneck classification filters which specialized analysis tools execute, combining kernel-level (ncu), instruction-level (SASS), and system-level (nsys) profiling; (3) a domain-adapted MCTS with progressive widening, asymmetric branching, log-reward calibration, dead-end pruning, and search memory for cross-iteration learning; and (4) direct CuTe source-level code generation via autonomous code search over the CUTLASS/CuTe codebase. On KernelBench, KernelPro achieves geometric mean speedups of 2.42x/4.69x/5.30x on Levels 1/2/3, establishing state-of-the-art performance across all difficulty levels. On VeOmni’s expert-optimized MoE training kernels, KernelPro achieves 1.23x over hand-tuned Triton by generating a from-scratch raw-CUDA+CuTe Hopper WGMMA kernel. Ablation studies demonstrate that each design component independently and significantly improves optimization quality: micro-profiling tools (p 0.0001 vs raw metrics), MCTS search (26% higher geometric mean vs greedy, p = 0.004), and proactive tool orchestration (23% improvement, p = 0.035). Finally, KernelPro is the first CUDA kernel coding agent to optimize energy efficiency beyond the speed-only focus of prior systems, demonstrating an 11.6% measured energy reduction at matched speed.
[LG-48] Listening Like a Judge: A Music-Aware Framework for Automatic Singing Performance Evaluation INTERSPEECH2026
链接: https://arxiv.org/abs/2606.26451
作者: Neelam Saini,Sourav Ghosh
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Accepted at Interspeech 2026. Supplementary material: this https URL (backup mirror: this https URL )
Abstract:Automatic singing quality assessment (SQA) requires evaluating lyrical correctness and musical fidelity while handling expressive variations. However, existing systems largely rely on either acoustic cues or lyric transcriptions exclusively, limiting holistic performance evaluation. Furthermore, their integration is non-trivial due to challenges in robust singing transcription amid melisma, vibrato, and tempo elasticity. To this end, we propose MusicJudge, a modality-guided framework for automated SQA that performs block-aligned multimodal analysis by coupling lyric correctness with pitch-rhythm fidelity. It detects semantically meaningful lyric blocks using multi-signal matching that integrates semantic embeddings, lexical similarity, and phonetic alignment. To improve singing audio transcription, we introduce Modality-Guided LoRA for ASR fine-tuning. Experiments across datasets demonstrate strong agreement with human expert judgments and validate the generalizability of MusicJudge.
[LG-49] Embedding Foundation Model Predictions in Discrete-Choice Models with Structural Guarantees ICML2026
链接: https://arxiv.org/abs/2606.26432
作者: Yingshuo Wang,Xian Sun,Yanhang Li,Zhichao Fan,Zexin Zhuang
类目: Machine Learning (cs.LG); Econometrics (econ.EM)
*备注: Extends arXiv:2605.26559 (ICML 2026 FMSD Workshop)
Abstract:Tabular foundation models achieve strong accuracy on choice prediction tasks, but their predictions often violate the economic logic those tasks require: raising a price can increase predicted demand, implied willingness-to-pay estimates are frequently negative or implausible, and unavailable alternatives receive nonzero probability. We propose a two-stage adapter that takes a foundation model’s predicted choice probabilities as a precomputed feature and embeds them inside a multinomial logit’s utility. In Stage 1, we fit the multinomial logit’s structural coefficients by maximum likelihood with sign constraints; in Stage 2, we freeze those coefficients and fit a small neural correction operating on the foundation model’s predictions. We prove that this composition exactly preserves the multinomial logit’s marginal rate of substitution, so analytically computable value-of-time becomes a mathematical guarantee rather than an empirical accident. Across three datasets and two foundation models, the adapter gains 6.4 percentage points (pp) of test accuracy on average over the multinomial logit and up to 12.8 pp, maintains 100% cost monotonicity, and produces values of time within the published transportation-economics range on the transportation datasets. Performance degrades gracefully under foundation-model context restriction, retaining at least 6 pp of accuracy gain even at 10% of the original foundation-model context.
[LG-50] Otter Weather: Skillful and Computationally Efficient Medium-Range Weather Forecasting
链接: https://arxiv.org/abs/2606.26421
作者: Cristiana Diaconu,Jonas Scholz,Aliaksandra Shysheya,Stratis Markou,Payel Mukhopadhyay,Miles Cranmer,Richard E. Turner
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:
Abstract:State-of-the-art medium-range AI weather models can outperform traditional Numerical Weather Prediction (NWP) but require massive training budgets. This restricts usage for under-resourced groups and severely limits fast model iteration. Here we develop Otter Weather, a highly efficient spatiotemporal forecasting model designed to democratise high-performance weather prediction with AI. Evaluated on ERA5 reanalysis data at 1.5° resolution using standard WeatherBench protocols, the Otter family significantly advances the skill-compute Pareto frontier. The deterministic version outperforms the best NWP baseline by 9.6% at a 24-hour lead time while requiring fewer than 3.5 A100-days for training. It provides a 2x efficiency gain over lightweight AI models and a 100-fold reduction in compute compared to resource-intensive frontier architectures. We extend these efficiency gains into probabilistic forecasting by training via the Continuous Ranked Probability Score (CRPS). Scaling to a larger architecture, Otter-XL achieves a 9.7% CRPS improvement over the IFS ENS baseline. This yields an almost two-fold increase in predictive skill over comparable lightweight models at similar compute budgets. Otter-XL also outperforms frontier architectures like GenCast by over 2%, while using an order of magnitude less compute. Finally, Otter is applied out-of-the-box to a complex acoustic scattering PDE task where it outperforms a state-of-the-art foundation modelling approach, suggesting that the advances made here might apply across a range of scientific domains.
[LG-51] At the Edge of Understanding: Sparse Autoencoders Trace The Limits of Transformer Generalization
链接: https://arxiv.org/abs/2606.26396
作者: Praneet Suresh,Jack Stanley,Sonia Joseph,Luca Scimeca,Danilo Bzdok
类目: Machine Learning (cs.LG)
*备注:
Abstract:Pre-trained transformers have demonstrated remarkable generalization abilities, at times extending beyond the scope of their training data. Yet, real-world deployments often face unexpected or adversarial data that diverges from training data distributions. Without explicit mechanisms for handling such shifts, model reliability and safety degrade, urging more disciplined study of out-of-distribution (OOD) settings for transformers. By systematic experiments, we present a mechanistic framework for delineating the precise contours of transformer model robustness. We find that OOD inputs, including subtle typos and jailbreak prompts, drive language models to operate on an increased number of fallacious concepts in their internals. We leverage this device to quantify and understand the degree of distributional shift in prompts, enabling a mechanistically grounded fine-tuning strategy to robustify LLMs. Expanding the very notion of OOD from input data to a model’s private computational processes, a new transformer diagnostic at inference time is a critical step toward making AI systems safe for deployment across science, business, and government.
[LG-52] Does Aurora Encode Atmospheric Structure? Latent Regime Analysis and Attribution ICLR2026
链接: https://arxiv.org/abs/2606.26361
作者: Emma Kasteleyn,Ana Lucic
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: Accepted at the FM4Science and Sci4DL workshops at ICLR 2026
Abstract:ML foundation models are able to emulate atmospheric dynamics accurately and efficiently but operate as opaque ``black boxes’'. We investigate the internal representations of the Aurora model using spatially pooled PCA and layer-wise relevance propagation (LRP). We find evidence that Aurora’s latent space is primarily organized by seasonal cycles, whereas extreme storm events do not form a linearly separable cluster. LRP indicates that the model attends to features consistent with the 3D vertical structure of the Great Storm of 1987. Perturbation tests show masking relevant regions degrades forecasts 3.31\times more than random masking. These findings suggest that Aurora learns meteorological coherence and vertical structure without explicit instruction.
[LG-53] EMA-FS: Accelerating GBDT Training via Gain-Informed Feature Screening
链接: https://arxiv.org/abs/2606.26337
作者: Yan Song
类目: Machine Learning (cs.LG)
*备注: 19 pages
Abstract:Gradient Boosted Decision Trees (GBDT), exemplified by LightGBM, spend a dominant fraction of training time – typically 65-70% – constructing per-feature histograms. Existing approaches such as random feature subsampling (feature_fraction) discard features without regard for their predictive utility. We propose EMA-based Feature Screening (EMA-FS), an algorithm-level optimization that maintains an exponential moving average (EMA) of per-feature split gains across boosting iterations and, after a short warmup, restricts histogram construction to the top-K features ranked by historical gain. Unlike random subsampling, EMA-FS is informed: it retains high-gain features while screening out low-gain ones. Operating at the per-tree level, it preserves full compatibility with LightGBM’s histogram subtraction trick, requiring no changes to core routines. We evaluate EMA-FS on datasets spanning financial fraud detection, advertising click-through prediction, industrial quality control, and synthetic benchmarks, with feature dimensionalities from 29 to 968. On dense, moderate-to-high-dimensional data it achieves significant speedups: 2.61x on a 500-feature synthetic benchmark and 1.45x on the 432-feature IEEE-CIS Fraud dataset at 30% retention. At 70% retention it improves AUC by 0.11 points while delivering a 1.34x speedup. On extremely sparse data (Bosch, 90% missing) it yields no speedup, as LightGBM’s sparse bin optimization already bypasses empty values. We further introduce Stochastic EMA-FS (S-EMA-FS), which replaces deterministic top-K selection with gain-weighted random sampling controlled by a concentration parameter beta, unifying deterministic EMA-FS (beta - infinity) and random subsampling (beta = 0) in one framework. Both are implemented in ~120 lines of C++ across all six LightGBM tree learners and are fully backward-compatible. Comments: 19 pages Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.26337 [cs.LG] (or arXiv:2606.26337v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.26337 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-54] Mesh-RL: Coupled subgrid reinforcement learning
链接: https://arxiv.org/abs/2606.26333
作者: Behnam Gheshlaghi,Bahador Rashidi,Shahin Atakishiyev
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning in large or sparse-reward environments suffers from slow temporal-difference reward propagation, as value information spreads only locally across the state space. We propose Mesh-RL, a spatial domain-decomposition framework inspired by the finite element method and domain decomposition theory, which partitions the environment into overlapping subgrids and enforces boundary-consistent temporal-difference updates. Such an approach enables localized learning while ensuring globally coherent value propagation. Unlike hierarchical or model-based approaches, Mesh-RL accelerates long-range credit assignment without modifying the reward function, Bellman operator, or introducing explicit planning mechanisms. We evaluate Mesh-RL on hazard-dense grid-world environments with varying geometries and mesh resolutions. Across Q-learning, SARSA, and Dyna-Q, Mesh-RL consistently improves convergence speed, cumulative reward, and learning stability. Higher mesh resolutions sustain exploration, prevent premature convergence, and substantially accelerate value propagation to distant states. While Dyna-Q already benefits from internal planning, it still achieves additional gains under structured decomposition. Overall, Mesh-RL introduces a principled spatial domain-decomposition mechanism for accelerating temporal-difference learning. Our framework bridges finite element method-inspired boundary-consistency techniques from scientific computing with reinforcement learning to improve sample efficiency in sparse-reward environments. We will release source code of the study.
[LG-55] High-Probability PL-SGD with Markovian Noise: Optimal Mixing and Tail Dependence
链接: https://arxiv.org/abs/2606.26316
作者: Dhruv Sarkar,Aprameyo Chakrabartty,Vaneet Aggarwal
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study first-order methods for smooth objectives satisfying the Polyak-Łojasiewicz (PL) condition when gradient samples are generated by an exogenous Markov chain. In the light-tailed setting, prior uniform-in-time high-probability bounds for ordinary Stochastic Gradient Descent (SGD) under a standard growth envelope scale as \widetildeO(t_mix^2/k) , leaving a gap with the \widetildeO(t_mix/k) expectation bounds. We close this gap using a lag-blocking argument to establish a uniform high-probability guarantee with a leading stochastic term of \widetildeO(t_mix/(k+K_0)) under geometric mixing. We prove this linear dependence on the mixing time is optimal via a matching \Omega(\sigma^2 t_mix/k) lower bound on a quadratic objective driven by a persistent two-state chain. We then extend this framework to heavy-tailed Markovian gradients satisfying a stationary finite- p -moment condition, p \in (1,2] . We design an all-samples clipped block method that uses every Markov transition while mitigating Markovian bias. Under a transition budget T , this algorithm achieves a high-probability stochastic error of \widetildeO(\sigma_p^2(t_mix/T)^2(p-1)/p) . We establish a matching lower bound by reducing PL optimization to heavy-tailed mean estimation for a sticky Markov chain. Ultimately, this work tightly characterizes the optimal polynomial dependence on mixing time for light-tailed PL-SGD, and the optimal heavy-tail exponent and effective-sample-size dependence in the robust regime. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.26316 [cs.LG] (or arXiv:2606.26316v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.26316 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-56] Equivariance and Augmentation for Bayesian Neural Networks
链接: https://arxiv.org/abs/2606.26273
作者: Miaowen Dong,Axel Flinth,Jan E. Gerken
类目: Machine Learning (cs.LG)
*备注:
Abstract:Symmetries are important for many deep learning tasks, ranging from applications in the sciences to medical imaging. However, there is an ongoing debate about whether to impose symmetry constraints on the neural network architecture (yielding equivariant neural networks) or learn them from augmented training data. Although equivariant networks are well-studied theoretically, much less is known about data augmentation, since analyzing augmentation requires control over the training dynamics. Inspired by recent results that show that augmented infinite deep ensembles are exactly equivariant, we study data augmentation for Bayesian neural networks (BNNs) trained with variational inference. We focus on variational distributions in the exponential family and derive conditions under which exact equivariance is reached. We furthermore obtain bounds on the equivariance error and introduce three novel symmetrization techniques which boost the effect of data augmentation in this setting. We conduct extensive numerical experiments which show that one of our symmetrization methods (orbit expansion) outperforms the baseline in both equivariance and overall performance. Our code is available at this http URL
[LG-57] Dataset Usage Inference without Shadow Models or Held-out Data CVPR2026
链接: https://arxiv.org/abs/2606.26257
作者: Wojciech Łapacz,Stanisław Pawlak,Jan Dubiński,Franziska Boenisch,Adam Dziedzic
类目: Machine Learning (cs.LG)
*备注: Accepted at the 2nd Workshop on Synthetic Adversarial ForEnsics (SAFE), CVPR 2026 (non-archival)
Abstract:How much of my data was used to train a machine learning model? Dataset Usage Inference (DUI) aims to answer this by estimating what fraction of a dataset contributed to a model’s training. However, existing DUI methods rely on assumptions that rarely hold in practice: they require training expensive shadow models to imitate the target model, and they assume access to both known training samples and an in-distribution held-out set confirmed to be absent from training. These conditions make current approaches impractical for modern large models and real data ownership disputes. We introduce a practical DUI framework that removes these constraints. Our method requires neither shadow models nor real held-out data. Instead, it generates synthetic non-member samples, extracts diverse membership signals, and casts DUI as a mixture proportion estimation problem to estimate what share of the candidate dataset was used during training. Experiments on large image generative models show that our method reliably quantifies dataset usage, providing a practical tool for data owners to determine how much of their data was used to train a model.
[LG-58] A General Framework for Learning Algebraic Properties from Cayley Graphs using Graph Neural Networks
链接: https://arxiv.org/abs/2606.26212
作者: Tal Weissblat
类目: Machine Learning (cs.LG); Group Theory (math.GR)
*备注:
Abstract:A Graph Neural Network (GNN) framework for predicting the solvability of finite groups from their Cayley graph representations was introduced in [1]. In the present work, we generalize this approach and develop a property-independent framework for learning algebraic properties of finite groups directly from Cayley graphs. As representative case studies, we consider abelianity, nilpotency, and solvability. Using a common GNN architecture and training pipeline, we investigate the extent to which algebraic structure can be recovered from graph-based representations alone. Results on a collection of finite groups drawn from several families demonstrate that the framework successfully learns and distinguishes multiple algebraic properties from their associated Cayley graphs. These findings suggest that substantial algebraic information is encoded in graph representations and can be extracted through GNNs. More broadly, the proposed framework provides a proof of concept for applying graph representation learning to the study of algebraic properties of finite groups.
[LG-59] opology-Informed Neural Networks for Flood Detection in Optical and Synthetic Aperture Radar Imagery
链接: https://arxiv.org/abs/2606.26204
作者: Sophia Li,Max Zhao,Raghu G. Raj,Tianyu Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Floods frequently impact regions around the world. Rapid and accurate flood detection is crucial for emergency response and timely mitigation of human and economic loss. The expanding availability of satellite data and advances in artificial intelligence have enhanced monitoring of environmental hazards, but many flood events remain challenging to detect because cloud cover obscures optical satellite imagery. Rambour et al. introduced the SEN12-FLOOD dataset and extracted per-image features using a ResNet-50 convolutional neural network backbone, then fed these features into a gated recurrent unit network to show that temporal information can substantially improve accuracy compared to single-image baselines. More recently, Chamatidis et al. showed that a vision transformer can achieve strong performance with popular convolutional architectures. However, these models typically function as opaque black boxes, making it difficult to interpret their decision boundaries, learned features, and internal reasoning, especially in safety-critical domains like remote sensing. In contrast, topological data analysis (TDA) provides a mathematically grounded framework for capturing global structural features of data. TDA has emerged as a powerful tool for analyzing complex imagery, especially imagery with geometrically interpretable structures, of which floods are a prime candidate. In this work, we systematically evaluate topological descriptors for flood detection using the open-source SEN12-FLOOD dataset. By extracting topological features from each image and incorporating them into neural networks, we demonstrate that topological descriptors carry meaningful flood signals independently and complement existing networks to yield more robust and interpretable flood detection systems.
[LG-60] Federated Hash Projected Latent Factor Learning
链接: https://arxiv.org/abs/2606.26192
作者: Jialan He
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Hash Learning (HL) is an efficient representation learning approach that maps real-valued data into compact binary representations. Traditional HL methods typically require users to upload personal data to a central server, which is incompatible with increasingly stringent data security regulations. Federated Learning (FL) provides a decentralized paradigm for learning globally optimal models without centralizing private data. However, most FL methods rely on transmitting large-scale real-valued gradient information, leading to high communication overhead and potential privacy risks. Integrating HL into FL is a promising solution. Nevertheless, existing HL methods suffer from limited representational capacity of binary codes, which may degrade model accuracy. To address this challenge, we propose a Federated Hash Projected Latent Factor (FHPLF) model. FHPLF introduces three key innovations: (a) replacing real-valued gradient matrices with binary gradient-like matrices, significantly reducing computation, storage, and communication costs while enhancing privacy protection; (b) leveraging Projected Hamming Distance for similarity modeling, which captures the importance of individual binary bits to improve representation capability; and © proposing a Secure Binary Gradient Reassembly and Privacy-Enhanced Upload (SBG-PEU) strategy to further reduce the risk of user interaction leakage during transmission. Extensive experiments on four real-world datasets demonstrate that FHPLF consistently outperforms state-of-the-art HL and FL methods, achieving a favorable trade-off among accuracy, efficiency, and privacy preservation.
[LG-61] Clue-Guided Money Laundering Group Discovery
链接: https://arxiv.org/abs/2606.26189
作者: Boyang Wang,Jianing Cao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Money Laundering Group Discovery (MLGD) aims to identify hidden criminal groups and recover their complete structures in large-scale financial networks. Existing graph anomaly detection methods mainly produce node-level risk alerts, while global group discovery methods passively search for suspicious groups over the whole network. Both are mismatched with real Anti-money-laundering (AML) investigations, where analysts usually start from a concrete clue and gradually expand the investigation to recover the responsible group. To address this gap, we propose Clue-Guided Group Discovery (CGGD), where a laundering group is progressively recovered from an initial clue set through analyst interaction. We further propose Clue2Group, a framework that first constructs a compact local investigation context to reduce noise and preserve chain-like and cycle-like laundering structures. It then estimates a clue-conditioned local risk field with a multi-semantic local-temporal GNN, and finally integrates risk, structural, and prior-pattern evidence to recover a coherent laundering group. Experiments on two large-scale AML benchmarks show that Clue2Group provides a practical clue-driven analysis framework for AML investigations, offering a feasible step toward bridging the gap between graph-based AML research and real investigation workflows.
[LG-62] Necessary but Not Sufficient: Temperature Control and Reproducibility in LLM -as-Judge Safety Evaluations
链接: https://arxiv.org/abs/2606.26185
作者: Hiroki Tamba
类目: Machine Learning (cs.LG)
*备注: 7 pages, 2 tables
Abstract:LLM-as-judge (“grader”) components are now standard in evaluation harnesses, including safety evaluations where a pass/fail verdict may gate downstream deployment decisions. A widespread assumption is that setting the grader’s sampling temperature to 0 makes grading deterministic. We test this assumption against a real safety-evaluation codebase (Japan AISI’s open-source aisev) and show it fails on two levels. First, the harness invokes its grader without setting temperature or seed; the underlying provider silently applies its default of 1.0, so items near the decision boundary flip pass/fail across identical runs (per-item disagreement up to ~50% over 20 runs). Second, pinning temperature=0 reduces but does not eliminate flips: across 690 API calls spanning two providers, three model tiers, and five sampling configurations, 1-2 of 7 borderline items remain non-reproducible even under forced greedy decoding (top_k=1). Claude Opus 4.7/4.8 has since deprecated temperature entirely, rendering the primary mitigation inapplicable to newer model generations. These findings expose a structural gap: evaluation harnesses that report single-run verdicts without variance or grader-disagreement metrics can present noise as a safety property. We release a reproduction harness (690 calls, 7 conditions) and recommend that harnesses treat grader disagreement as a first-class health metric alongside the scores themselves.
[LG-63] Implementation of reinforcement learning in chemical reaction networks: application to phototaxis as curiosity-driven exploration
链接: https://arxiv.org/abs/2606.26168
作者: Ruyi Tang,Grégoire Sergeant-Perthuis(LCQB-AG),David Colliaux
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: This paper is accepted as talk at ALIFE 2026 (Waterloo, Canada)
Abstract:Living systems navigate environments using noisy and incomplete sensory signals. In unicellular algae, phototaxis is often modeled as a mechanistic run–tumble process driven by stimulus–response rules. However, such descriptions overlook how organisms actively sample their environment to reduce sensory ambiguity. From a minimal cognition perspective, we reframe this navigation as a subjective, information-driven sensorimotor process. To this end, we propose a framework linking a Partially Observable Markov Decision Process (POMDP) with biochemical reaction dynamics. Environmental variables are hidden, while the cell updates a minimal internal state from each observation through a memoryless Bayesian step. These internal dynamics balance orienting toward light with exploratory reorientation and can be implemented through Chemical-Reaction-Network Ordinary Differential Equations (CRN–ODEs). Our model includes a biophysical observation process for photoreception and a chemically computable polynomial bound on information gain. Using Inverse Reinforcement Learning (IRL) on 30 experimentally recorded Chlamydomonas trajectories, we infer the behavioral objective consistent with observed phototactic motion and benchmark the resulting dynamics with standard Stochastic Simulation Algorithm (SSA) baselines. Our model reproduces the empirical alignment-to-light distribution, comparable to objective SSA baselines on this dataset. Within this framework, run–tumble alternation emerges as an information-acquisition strategy: tumbling reorients the cell to sample new sensory configurations and resolve sensor ambiguity, demonstrating how intracellular biochemical networks can support adaptive information-seeking behavior in cellular navigation.
[LG-64] chisao: A GPU-Native Parallel Optimizer for Multimodal Black-Box Functions via Convergence-Anticonvergence Oscillation
链接: https://arxiv.org/abs/2606.26164
作者: Ira Wolfson
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Data Analysis, Statistics and Probability (physics.data-an); Computation (stat.CO)
*备注: 22 pages, 4 Appendixes
Abstract:Finding all modes of a multimodal black-box function is a fundamental challenge in optimization, Bayesian inference, and scientific computing. Existing approaches – basin-hopping, CMA-ES, multistart gradient descent – operate sequentially and cannot exploit the massive parallelism of modern GPU hardware. We introduce \chisao (\textbfConvergence-\textbfHalt-\textbfInvert-\textbfStick-\textbfAnd-\textbfOscillate), a GPU-native population optimizer that runs an entire sample batch simultaneously and exploits a deliberate convergence-anticonvergence oscillation cycle to escape local traps while freezing confirmed modes. The structural move is asymmetric: samples that reach true peaks are frozen (``stuck’') and preserved, while the rest keep exploring via momentum-based anti-convergence and stochastically smoothed gradients. Adaptive reseeding via two complementary strategies (Repulse Monkey and Golden Rooster) maintains population diversity throughout. On all 42 functions of the Simon Fraser University optimization benchmark suite across dimensions d \in \2, 4, 8, 16, 32, 64\ , \chisao achieves \textbf100% mode recovery where all CPU baselines collapse at d \geq 8 on the hardest multimodal functions, at up to \textbf 34\times speedup over basin-hopping on functions where all methods succeed (Michalewicz d=64 ) and up to \textbf 39\times on unimodal functions (Rotated Hyper-Ellipsoid d=64 , pure GPU dividend). All benchmarks evaluate the objective by value alone – gradients come from finite differences – so the reported speedups are a derivative-free worst case. Under substantial likelihood noise ( \sigma_\mathrmnoise up to 1.0), mode detection remains 100% reliable. The algorithm is available as a standalone open-source Python package on PyPI.
[LG-65] Reinforcement Learning Enables Autonomous Microrobot Navigation and Intervention in Simulated Blood Capillaries
链接: https://arxiv.org/abs/2606.26154
作者: Jannik Drotleff,Samuel Tovey,Paul Hohenberger,Christoph Lohrmann,Julian Hoßbach,Konstantin Nikolaou,Christian Holm
类目: Robotics (cs.RO); Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
*备注: 12 pages, 4 figures
Abstract:Autonomous microrobots navigating biological vasculature could enable targeted drug delivery and thrombolysis, yet training control policies for realistic environments remains an open challenge. Prior reinforcement learning (RL) studies of microrobotic navigation have been limited to idealized geometries that omit complex hydrodynamic flow fields, confined branching structures, and dense cellular obstacles found in vivo. Here, we develop a physically grounded simulation of a blood capillary network, incorporating realistic hydrodynamic flow fields, explicit red blood cell dynamics, and anatomically derived branching geometry, and train deep RL agents to navigate it via chemotaxis. We systematically map the physical limits of navigation across robot size and swimming speed, revealing a forbidden regime where Brownian motion and flow overcome propulsion. Successful agents independently discover multiple universal strategy types, including run-and-rotate and energy-efficient search-and-sit policies, regardless of robot parameters. Without retraining, these agents perform targeted blocking and unblocking of capillary flow, restoring throughput to healthy baseline levels. These results establish RL as a viable framework for developing autonomous microrobotic intervention strategies in complex biological environments.
[LG-66] Code evolution for link prediction in complex networks
链接: https://arxiv.org/abs/2606.26132
作者: Alexey Vlaskin,Eduardo G. Altmann
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:
Abstract:The problem of predicting links in complex networks appears in different disciplines and has led to a variety of ingenious human-designed methods. We use this rich program space to explore the performance and behavior of automated code-evolution systems tasked to obtain machine-designed methods for link prediction. Despite being trained on limited data, algorithms evolved through code evolution outperform human-designed methods (with an average AUC score of 0.915 vs. 0.783, computed over 580 networks) and show improved computational efficiency, allowing them to be applied to networks with millions of links. The discovered methods follow approaches that have been employed in human-designed methods, but contain key innovations in the selection and combination of node- and link-features. This illustrates the role modern large language models and genetic algorithms can play in algorithmic innovation and scientific discovery more generally.
[LG-67] Physics-guided Convolutional Neural Network for Domain Growth Prediction in Systems with Conserved Kinetics
链接: https://arxiv.org/abs/2606.26128
作者: Vijay Yadav,Madhu Priya,Manish Dev Shrimali,Prabhat K. Jaiswal
类目: Machine Learning (cs.LG); Soft Condensed Matter (cond-mat.soft)
*备注:
Abstract:The spatiotemporal evolution of many physical, chemical, and biological systems is described by nonlinear partial differential equations (PDEs). Recently, deep neural network-based surrogate models have gained increasing interest as efficient alternatives to computationally expensive traditional numerical solvers. In this work, we propose an attention-based, physics-guided convolutional neural network as a surrogate model to learn the microstructural evolution of such systems. We train the model to accurately predict the full time-evolution of phase separation in binary mixtures governed by the Cahn-Hilliard equation. We show that predictions from our trained surrogate model remain stable and accurate over long-time rollouts for both critical and off-critical mixtures and preserve the mixture composition throughout evolution. We also show that our model accurately captures the growth of domain size and is consistent with the Lifshitz-Slyozov domain-growth law. The prediction results demonstrate the effectiveness of the proposed framework for modeling systems with conserved kinetics and can be extended to other complex dynamical systems.
[LG-68] When are likely answers right? On Sequence Probability and Correctness in LLM s
链接: https://arxiv.org/abs/2606.27359
作者: Johannes Zenn,Jonas Geiping
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 38 pages, including 10 pages of main text and 28 pages of appendix, preprint
Abstract:Many decoding methods for large language models can be understood as shifting probability mass toward outputs that are more likely under the model, either locally at the token level or globally at the sequence level. Therefore, their success depends on a fundamental question: when does sequence probability, that is, the conditional probability of a continuation given a prompt, actually align with correctness? In this paper, we set out to quantify this relationship across decoding methods, models, and benchmarks at four levels: across decoding methods, across hyperparameters within a method, across prompt-answer pairs within a dataset, and across repeated responses to the same prompt. We find that higher sequence probability is often predictive of correctness across prompt-answer pairs within a fixed dataset. However, this relationship does not generally transfer to decoding decisions: increasing sequence probability by changing hyperparameters or methods does not reliably improve accuracy. Further, sequence probability is not a good indicator of correctness for responses to the same prompt. These findings clarify when decoding can and cannot be expected to improve correctness, and provide practical guidance for decoding, self-consistency, and verifier-free self-improvement.
[LG-69] Ribbon: Scalable Approximation and Robust Uncertainty Quantification
链接: https://arxiv.org/abs/2606.27269
作者: Graham Gibson,John Tipton,Kellin Rumsey,Natalie Klein
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Reliably quantifying predictive uncertainty is difficult for complex, high-dimensional, or misspecified models. Both fully Bayesian and bootstrap resampling methods provide principled uncertainty estimates but are often too expensive for modern machine-learning models because they require posterior sampling or repeated model refitting. We introduce Ribbon, a scalable approximation to Dirichlet-reweighted bootstrap uncertainty. Ribbon replaces repeated refitting with an influence-function linearization around a single fitted model, preserving the first-order data-reweighting structure of the Bayesian bootstrap while requiring only post-hoc linear algebra. Ribbon approximates the Bayesian-bootstrap or weighted-likelihood-bootstrap refitting target. With a general concentration parameter, Ribbon gives a calibrated Dirichlet-reweighting family whose uncertainty scale can be tuned on validation data. We show that Ribbon is asymptotically equivalent to a flat-prior Laplace approximation under correct likelihood specification and recovers the robust sandwich covariance under misspecification. Across synthetic regression, MNIST classification, and California Housing benchmarks, Ribbon provides competitive predictive performance and improved calibration in several settings while avoiding repeated model retraining.
[LG-70] Enabling self-supervised learned primal dual with Noise2Inverse
链接: https://arxiv.org/abs/2606.26991
作者: Antti Sällinen,Siiri Rautio,Santeri Kaupinmäki,Andreas Hauptmann
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:X-ray computed tomography reconstruction is an ill-posed inverse problem, particularly in low-dose and sparse-angle settings where measurements are noisy and incomplete. While learned reconstruction methods such as the Learned Primal-Dual algorithm achieve strong performance, they typically rely on supervised training with access to ground-truth data, which is often unavailable in practice. In this work, we propose a self-supervised reconstruction method by extending the Noise2Inverse framework to the Learned Primal-Dual algorithm. The resulting approach, called Noise2Inverse Learned Primal-Dual (N2I-LPD), enables training of a learned iterative reconstruction operator without ground-truth images by exploiting the statistical independence of noise in distinct measurements with respect to angular rotation of the CT-scan. We compare the proposed method with classical reconstruction methods, as well as neural network-based approaches such as a U-Net trained within the same N2I framework. The results demonstrate that N2I-LPD achieves improved reconstruction quality, highlighting the potential of combining learned reconstruction operators with self-supervised training strategies for practical CT imaging scenarios where ground-truth data is unavailable. Subjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Optimization and Control (math.OC) MSC classes: 00A69, 68T07 (Primary) 47G10, 65Y99 (Secondary) Cite as: arXiv:2606.26991 [eess.IV] (or arXiv:2606.26991v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2606.26991 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-71] Scalable Message-Passing Quantum Graph Neural Networks in the Weisfeiler-Leman Hierarchy
链接: https://arxiv.org/abs/2606.26873
作者: Snehal Raj,Brian Coyle,Léo Monbroussou,André J. Ferreira-Martins,Renato M. S. Farias,Elham Kashefi
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures (main text); 35 pages, 9 figures, 14 tables including Supplementary Information. Code available at this https URL
Abstract:Graphs provide a natural language for relational data in chemistry, biology and optimisation. Graph neural networks (GNNs) have driven much of the recent progress in learning from such data through message passing, a single primitive that generalises convolution and attention. Quantum counterparts have been proposed, but with limited connection to message passing and few guarantees on performance or scalability. More broadly, the trainability of variational quantum circuits is a recognised bottleneck for their wide applicability, and pre-training has emerged as one way to address it. Yet for a quantum model to be useful, it must offer expressivity guarantees along with demonstrable scalability. Here we show how a quantum graph neural network can be built to perform message passing, to be permutation equivariant, and to sit at a chosen level of the Weisfeiler-Leman hierarchy, the standard measure of how finely a model can tell graphs apart. We show that, as for classical GNNs, the training can be done first on small graph instances, allowing for a pre-training that can mitigate usual training issues, and its output can be read out at a cost that stays low as the graph grows. We validate the framework in large-scale simulations of up to 56 qubits across three datasets, on synthetic graphs that ordinary message passing cannot separate, on molecular property prediction, and on the travelling salesperson problem. Our framework opens a path for near-term quantum algorithms with theoretical guarantees and practical scalability, bringing the principles of graph learning into quantum circuit design.
[LG-72] State-Specific Respiratory Signatures for Affective and Stress Recognition: Interpretable Respiratory Markers Autocorrelation Lags and Compact CNN Models
链接: https://arxiv.org/abs/2606.26723
作者: Andrei Velichko,Mehmet Tahir Huyut
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 44 pages, 6 figures, 16 tables
Abstract:Respiratory activity is a direct and interpretable physiological channel for wearable stress and affective-state recognition, yet many studies emphasize classification accuracy without identifying which respiratory properties separate different states. This work reframes RESP-based recognition as a joint predictive and explanatory problem. Using the chest respiratory channel of the WESAD dataset, we analyze 60 s windows under leave-one-subject-out validation and combine two complementary branches: compact raw-signal one-dimensional convolutional neural networks (1D-CNNs) and physically grouped handcrafted respiratory signatures. The primary application task is binary stress versus non-stress detection, while baseline, stress, amusement, and meditation are additionally analyzed in a one-vs-rest setting to reveal state-specific respiratory markers. The feature space is organized into respiratory timing, breath-to-breath variability, waveform statistics, spectral/time-frequency descriptors, and autocorrelation/nonlinear predictability descriptors, with the raw 60 s signal treated as a sixth representation for the CNN branch. We introduce autocorrelation transition lags (Zpm/Zmp) as interpretable markers of respiratory correlation scale and separately evaluate exploratory FEG-Pro/Lyapunov-like descriptors. In the final CNN refit setting, the raw-signal model achieved the strongest stress-vs-rest performance, with accuracy 96.72 percent, macro-F1 95.30 percent, and MCC 90.61 percent. In contrast, compact feature models were stronger for baseline, with MCC 65.34 percent, amusement, with MCC 35.69 percent, and especially meditation, with MCC 88.65 percent. These results show that CNNs are most useful for the practical stress detector, whereas interpretable respiratory signatures provide stronger and more physiologically transparent state-specific markers for several non-stress conditions.
[LG-73] Generating Special Triangulations with Transformers
链接: https://arxiv.org/abs/2606.26660
作者: Charles Arnal,Jacky H. T. Yip,François Charton,Gary Shiu
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG); Algebraic Geometry (math.AG)
*备注: 21 pages, 11 figures. Contribution to the edited volume “Recent Progress in Computational String Geometry” (World Scientific), based on the BIRS-CMI workshop (26w5653)
Abstract:Triangulations, i.e., well-structured decompositions of geometric objects into triangle-like pieces, are central objects in many domains of mathematics and physics. In particular, fine, regular, and star triangulations (FRSTs) of 4D reflexive polytopes give rise to smooth Calabi-Yau threefolds, which are of significant interest in string theory. However, the high dimensionality and combinatorial complexity of triangulations make them particularly challenging to model with classical numerical methods or machine learning. In this work, we show that transformers, equipped with an appropriate encoding scheme, can be effectively trained to representatively generate new FRSTs across a range of polytope sizes. Moreover, these models can also self-improve through retraining on their own output. This opens the door to both concrete applications to the classification of Calabi-Yau manifolds and further research in physics, combinatorics and algebraic geometry.
[LG-74] Mean-Field PhiBE: Continuous-Time Mean-Field Reinforcement Learning from Discrete-Time Data
链接: https://arxiv.org/abs/2606.26498
作者: Erhan Bayraktar,Martin Hernandez,Qinxin Yan,Yuhua Zhu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:This paper addresses model-free continuous-time mean-field control in a setting where the population dynamics evolve continuously according to an unknown McKean-Vlasov stochastic differential equation, while only discrete-time transition data are available. In the model-based formulation, policy evaluation is naturally described by a stationary Hamilton-Jacobi-Bellman equation on \mathcal P_2(\mathbb R^d) , but this equation involves the drift and diffusion coefficients of the controlled McKean-Vlasov dynamics, which are not identifiable when only discrete-time data are available. On the other hand, a direct reduction to a time-discrete Bellman equation avoids the non-identifiability issue but loses the differential equation structure. To bridge these two viewpoints, we introduce a Mean-Field-PhiBE (MF-PhiBE), which incorporates discrete-time transition information into a continuous-time PDE on the Wasserstein space. The MF-PhiBE replaces the unknown infinitesimal drift and covariance in the policy-evaluation equation by one-step estimators computed from data, while preserving the generator structure of the McKean-Vlasov HJB equation. We also derive a policy-gradient theorem for entropy-regularized randomized feedback policies, expressing the actor direction through an action-wise infinitesimal advantage and the score of the policy. Combining these two ingredients yields a model-free actor-critic method. We prove a first-order consistency estimate showing that the value induced by an optimal MF-PhiBE policy approximates the optimal continuous-time value with an error of order \Delta t . In the linear-quadratic case, we show our approximation achieves second-order accuracy with only one-step data. Numerical experiments on an LQR benchmark and a crowd-aversion problem illustrate the proposed framework.
[LG-75] A probabilistic framework for online test-time adaptation
链接: https://arxiv.org/abs/2606.26457
作者: Daniel Corrales,David Ríos Insua
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:This paper presents a probabilistic framework for online test-time adaptation problems. In them, a model is trained on labeled data but must adapt to unlabeled data at test time under the assumption that training and test distributions potentially differ, that is, there might have been a distributional shift. The framework is based on a state-space modelling architecture from which parameter learning, parameter time evolution, prior tuning, and prediction can be characterized.
[LG-76] Interpreting “Interpretability” and Explaining “Explainability” in Machine Learning in Physics
链接: https://arxiv.org/abs/2606.26228
作者: Rikab Gambhir,Luisa Lucie-Smith,Jesse Thaler
类目: Data Analysis, Statistics and Probability (physics.data-an); Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph)
*备注: 31 pages, 3 figures, Part of the VERaiPHY Initiative
Abstract:We review the concepts of interpretability and explainability as they apply to machine learning in physics. We define interpretability as concerning the structural transparency of a model (the ability to understand or approximate its inner workings) and explainability as concerning the scientific content of a model (the ability to map it onto domain knowledge). We discuss the trade-offs each entails (interpretability vs. expressivity; explainability vs. adaptability), the contexts in which each is needed, and the intrinsic and post-hoc tools available for achieving them. Throughout, we emphasize that machine-learned models are subject to the same scientific questions as classical models, differing only in scale, and that interpretability and explainability are best understood as deliberate modeling choices rather than inherent properties. We also emphasize the importance of task specification and intervention plans as a core aspect of model design.
[LG-77] he Role of Input Dimensionality in the Emergence and Targeted Control of Adversarial Examples
链接: https://arxiv.org/abs/2606.26207
作者: Nasrin Malekzadeh Goradel,Niccolo Pancino,Yaser Gholizade Atani,Benedetta Tondi,Giovanni Bellettini,Mauro Barni
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Several theoretical works have tried to explain the adversarial vulnerability of deep neural networks through properties of high-dimensional geometry. However, the assumptions underlying these works are rarely examined empirically, and systematic evidence remains limited. In this work, we present a systematic study of the role of input dimensionality in both the emergence and the targeted control of adversarial examples. We first analyse the scope and limitations of existing theoretical frameworks based on concentration of measure, showing that real image classes exhibit strong empirical localization, beyond what such theories typically assume. We then conduct an extensive empirical evaluation across hierarchical image datasets spanning a wide range of input dimensionalities and diverse neural architectures. Our results consistently show that adversarial examples become easier to construct as dimensionality increases. We also investigate how input dimensionality affects the additional difficulty of crafting targeted adversarial examples. In particular, we provide theoretical arguments showing that high-dimensional geometry implies that enforcing a specific target label entails only a limited additional distortion compared to untargeted attacks. We corroborate this insight through extensive experiments, demonstrating that the gap between targeted and untargeted perturbations remains small and further narrows as input dimensionality increases. While, taken together, our findings establish high input dimensionality as a fundamental factor underlying the emergence and targeted control of adversarial examples, whether this phenomenon primarily arises from the interplay between high-dimensional geometry and data distributions or from the architectural properties of deep neural networks remains an open question.
附件下载


