Comprehensive benchmarking analysis of Vultr Block Storage performance, rate limits, tiers, and fio-based replication methodology.
We've run extensive benchmarking of Vultr Block Storage. For additional context on the performance metrics referenced throughout this document and to better understand how storage performance is measured, benchmarked, and compared, see What Are the Fundamentals of Storage Performance?. Vultr Block Storage is designed for use with Vultr Cloud Compute instances and is offered in two performance tiers: HDD Block and NVMe Block.
HDD Block is designed to be cost effective by relatively low performance. It’s available at all Vultr sites. As the name suggests, it is largely composed of hard disk drives, but it also has some flash storage that it uses to accelerate certain types of metadata operations and caches. Your data is also spread redundantly across a large number of such drives to increase performance, but it is still fundamentally limited by the speed of hard disk drives.
NVMe Block is designed to be much higher performance, but as a consequence, it costs more. It is available at a large number of Vultr sites, especially those with GPU or high-performing CPU systems. As the name suggests, it is composed of NVMe flash drives and so needs no further acceleration. It too is redundant, but its mode of redundancy is chosen for speed, not lower cost. Again, your data is spread across a large number of such drives to increase performance, but the increased performance of NVMe drives makes it significantly faster.
Tiers of Vultr Block Storage are limited at the hypervisor to not allow a given subscription to exceed certain IOPS and throughput levels for sustained periods. These limits are imposed to avoid situations where one VM instance consumes all the available network throughput or processing power for its storage workload, limiting what would be available for competing workloads.
Block Storage rate limits also allow for short bursts of up to 60 seconds where up to 150% of the sustained limit can be achieved. The burst capability requires a period of time where lower than the sustained limits are requested in order to make burst capacity available after being consumed.
| Tier | Sustained IOPS Limit | Sustained Throughput Limit |
|---|---|---|
| HDD Block | 500 IOPS | 100 MB per second (95.3 MiB) |
| NVMe Block | 10,000 IOPS | 400 MB per second (381.4 MiB) |
It is important to understand how these limits interact with choice of block size. For instance, 500 IOPS at 4 KB block size will result in only 2 MB/s. For example, you will need a block size in excess of 209.7 KB to reach the 100 MB/s throughput limit of HDD Block before hitting the 500 IOPS limit. That is to say that 209.7 KB × 500 IOPS = 104.85 MB (≈100 MiB).
We used the utility fio to measure performance for several block sizes, 4 KB, 64 KB, 512 KB, 1024 KB, and 4096 KB. We performed tests with 100% read (both random and sequential), 100% write (again both random and sequential), and a mixed workload of both reads and writes (50%/50%).
We performed these tests across a wide variety of VM instance plans so as to see any break points where insufficient CPU or memory in the plan could impact storage performance. We did not find such a break point with block performance even on plans with only 1 core and 1 GB of RAM.
This table shows the performance results for each of the three IO Types at each of the tested block sizes. In each case we used a queue depth of 4, and a job count of 4.
| IO Type | Block Size | Mean IOPS | Mean Throughput (MiB/s) | Mean Latency (ms) |
|---|---|---|---|---|
randwrite, randread, and randrw |
4 KB | ≈10,000 | ≈40 | 2.7-3.2 |
randwrite, randread, and randrw |
64 KB | ≈6,000 | ≈381 | 4-5 |
randwrite, randread, and randrw |
512 KB | ≈750 | ≈381 | 40-50 |
randwrite, randread, and randrw |
1 MB | ≈380 | ≈381 | 80-100 |
randwrite, randread, and randrw |
4 MB | ≈95 | ≈381 | 320-420 |
It is important to note that this table lists mean throughput in MiB/s whereas the rate limits are in MB/s. For example, 381.4 MiB/s is 400 MB/s which is the throughput rate limit for NVMe Block storage.
At the 4 KB block size, the individual IOs are so small that the limitation to throughput is the number of IOPS reaching the IOPS rate limit. For larger block sizes, the throughput rate limits are instead hit and that keeps the number of IOPS lower than the IOPS rate limit.
Latency rises as block size increases because we’ve hit the throughput rate limit. Rate limiting operates by not answering requests until doing so would keep the throughput below its rate limit, effectively injecting latency.
For HDD Block, results are similar in that either the IOPS rate limit or the throughput rate limits are hit. The primary difference is that the rate limits for HDD Block are lower.
The overall conclusion is that both HDD Block and NVMe Block can achieve their rate limited speeds of 500 IOPS and 100 MB/s for HDD and 10,000 IOPS and 400 MB/s for NVMe, even on the smallest instance plans.
You can replicate these results for yourself by using fio at any of the blocksizes mentioned at any operations mix you would like.
Install the fio utility and any dependencies. It can be found on git.kernel.org, but your distribution likely has it available as a package. In most distributions the package is simply called fio. You should also install libaio so that it is available to fio. The package is usually called either libaio_dev or libaio_devel, depending on your distribution.
When running fio, you will need to create a job configuration file that you can reference and then run a command line that points at the job file.
[FIOJOB]
filename=/mnt/vbs/fio.raw
size=500G
random_generator=lfsr
buffered=0
direct=1
invalidate=0
ioengine=libaio
rw=randwrite
bs=4k
iodepth=4
numjobs=16
runtime=900
loops=1
time_based=1
Key values to change to match the workload you are testing are:
filename= This should be a file on the file system where you are testing. If you are testing directly on the block device itself, understand that the test is destructive to any data contained in the file and will destroy any file system on the raw device.direct= 1 enables O_DIRECT, 0 disables it. Use direct=1 with ioengine=libaio.ioengine= We recommend libaio for best results, but you may wish to compare it with sync or psync. We used libaio in our testing.rw= randread, randwrite, and randrw are the most useful options.bs= Block size. We used 4k, 64k, 512k, 1M, and 4M in our testing.iodepth= The queue depth per job. We used 4 in our testing.numjobs= The number of simultaneous jobs to run. We used 4 in our testing.Then you can reference the job config from the command line:
$ fio \
--eta=never \
--status-interval=5000ms \
--output-format=json+ \
$FIOJOBFILE
Where $FIOJOBFILE is the path to the job file created above. See the fio documentation for more details.
Enable higher levels of parallelism by making more simultaneous requests. Increase the number of processes, threads, or workers issuing I/O operations. In the fio benchmarking utility, parallelism can be increased by increasing numjobs and iodepth (the number of requests each job allows to be in flight without a response).
Larger queue depths allow more requests to remain in flight while waiting for responses, preventing high latency from artificially reducing throughput.
In many cases, asynchronous I/O can increase performance. Some applications can leverage libaio via a configurable option. In fio, enable asynchronous I/O with ioengine=libaio.
In most cases, caches should be disabled to use Direct I/O. In fio, this is achieved with direct=1 and buffered=0.