AI Generated Images with Stable Diffusion XL and Vultr Cloud GPU

Introduction

Stable Diffusion XL (SDXL) is a deep learning text-to-image diffusion model developed by Stability AI. It can enhance existing images or generate new ones guided by a text description prompt. SDXL can generate high-resolution images that are more realistic and visually appealing. It can also generate images with more complex details, such as faces and objects.

This guide explains how you can use the Stable Diffusion XL (SDXL) model on a Vultr Cloud GPU instance. You are to install the required packages, generate an image using Stable Diffusion XL Model, increase the image quality using the Refiner model, combine the Refiner model and the base model and mask images by Inpainting.

Prerequisites

Before you begin:

Deploy a fresh Ubuntu 22.04 A100 Vultr Cloud GPU server with at least:
- 80 GB GPU RAM
Securely access the server using SSH as a non-root user with sudo privileges
Update the server.
Install JupyterLab and PyTorch.

Install Required Packages

Install diffusers and other required packages
```
 $ pip install diffusers transformers accelerate safetensors matplotlib ipywidgets
```
The above command installs the following packages:
- transformers: Consists of multiple pre-trained models used for Natural Language Processing (NLP), Named Entity Recognition (NER), machine translation, and sentiment analysis.
- diffusers: These are pre-trained diffusion models in the form of prepackaged pipelines. It provides tools for building and training diffusion models. Diffusers also include many different core neural network models, and are used as building blocks to create new pipelines.
- accelerate: Enables PyTorch to run across any distributed configuration. It uses accelerators like GPUs and TPUs to improve efficiency and scalability, speed up natural language processing (NLP) workflows, and enhance performance.
- safetensor: It is a package aimed at enhancing debugging and diagnostics for PyTorch tensors. It provides additional features and checks to help identify and prevent common tensor-related issues.
- matplotlib: It enables you to display the images in Jupyter Notebook.
Create a new directory to save generated images
```
 $ mkdir /home/jupyter/notebooks/generated_images
```
The above command creates a new directory to save the generated images.

Image Generation using Stable Diffusion XL Model

Stable Diffusion XL (SDXL) is a pre-trained text-to-image generation model with 3.5 billion parameters, capable of generating realistic images with resolutions of up to 1024 x 1024 pixels.

To generate images with Stable Diffusion XL, import the required modules such as StableDiffusionXLPipeline from diffusers, torch, and matplotlib.pyplot. Then, initialize the model using the provided model_id and set it up for GPU acceleration by assigning it to the CUDA device.

Open a new Notebook and set it's name to Stable Diffusion XL Base
To use the model, import the following modules
```
 import torch
 import matplotlib.pyplot as plt
 from diffusers import StableDiffusionXLPipeline
```
Below is what each module does:
- StableDiffusionXLPipeline class provides an interface to the model for generating images.
- torch enables support for tensor computations. In this context, it's used for GPU acceleration.
- matplotlib library allows you to display the generated images.
Declare the model
```
 model_id = "stabilityai/stable-diffusion-xl-base-1.0"
 pipe = StableDiffusionXLPipeline.from_pretrained(
     model_id, 
     torch_dtype=torch.float16, 
     variant="fp16", 
     use_safetensors=True
 )
 pipe.to("cuda")
```
By calling the from_pretrained method, the pipeline takes care of the necessary setup to generate images from text.

The parameters passed to the from_pretrained() method are:
- The model_id of a pipeline. The function call above loads the "stabilityai/stable-diffusion-xl-base-1.0" model. The model ID can also be the path to a local directory containing model weights or a path (local or URL) to a checkpoint file.
- torch_dtype is the Torch datatype of the tensors used for pipeline computations. float16 is specified explicitly so that the model computations are done in 16-bit floating point numbers, to assist systems with less GPU RAM. It is possible to let the system choose the optimal data type using torch_dtype = "auto".
Generate an image by providing a prompt as below
```
 prompt = "Astronaut in a jungle"
 image = pipe(prompt=prompt).images
```
Replace Astronaut in a jungle with your desired text prompt

The above code declares and feeds the prompt to the previously declared pipeline and stores the image attribute. A different image is generated each time you run the module, you can enhance the output image by providing a more detailed prompt.
Render the generated image
```
 plt.imshow(image[0])
```
The Astronaut in a jungle generates an image like the one below:

Increase Image Quality using Stable Diffusion XL Refiner Model

Image-to-Image is a pre-trained diffusion pipeline from the Diffusers library that also includes a refiner checkpoint specialized in denoising low-noise stage images to generate images of improved high-frequency quality

The following section explains the steps to refining images using a pre-trained pipeline from the Diffusers library along with a refiner check point

Open a new Notebook and set it's name to Stable Diffusion XL Image to Image.
To clear GPU memory and start running the model, navigate to the Kernel menu option in your Jupyter Notebook, and click Shutdown Down All Kernels.
To use the model, import the following packages
```
 import torch
 import matplotlib.pyplot as plt
 from diffusers import StableDiffusionXLImg2ImgPipeline
 from diffusers.utils import load_image
```
Below is what each module does:
- StableDiffusionXLImg2ImgPipeline class provides an interface to the model for refining images.
- load_image function from the utils module within the diffusers library loads image data from a given source.
Declare the model
```
 model_id = "stabilityai/stable-diffusion-xl-refiner-1.0"
 pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained(
     model_id, 
     torch_dtype=torch.float16, 
     variant="fp16", 
     use_safetensors=True
 )
 pipe = pipe.to("cuda")
```
By calling the from_pretrained method, the pipeline takes care of the necessary setup to refine the generated images.

The model_id of a pipeline. The function call above loads the "stabilityai/stable-diffusion-xl-refiner-1.0" model. The model ID can also be the path to a local directory containing model weights or a path (local or URL) to a checkpoint file.
Load the image
```
 init_img_url = "https://i.imgur.com/UikG1MN.png"
 init_image = load_image(init_img_url).convert("RGB")
```
url contains the image's URL, which needs to be refined. And then, load_image(url).convert("RGB") will load the image and convert the image into RGB color space.
Generate an image by providing a prompt as below
```
 prompt = "A photo of an astronaut riding a horse on mars"
 image = pipe(prompt, image=init_image).images
```
Use the same prompt here that you will use to generate your original image or use a prompt that describes your input image. For example, the above image's URL, image was generated by using this same A photo of an astronaut riding a horse on mars prompt.
Render the generated image
```
 plt.imshow(image[0])
```
The A photo of an astronaut riding a horse on mars generates an image like the one below:

This refiner checkpoint can also be used as a second-step” pipeline after having run the base checkpoint to improve image quality.
You can also access the above code in the following Notebook

Stable Diffusion XL Image to Image notebook

Combine Refiner and Stable Diffusion XL Model

Diffusers also allow you to use two pipelines simultaneously from its library. With this, you can generate an image using the StableDiffusion-XL base checkpoint and improve the quality of the generated image using the refiner checkpoint

When you use the base and refiner models together to generate an image, this is known as an ensemble of expert denoisers. The ensemble of expert denoisers approach requires fewer denoising steps in total compared to passing the base model's output to the refiner model, making it significantly faster to run. However, it's important to note that you won't be able to inspect the base model's output because it still contains a significant amount of noise.

Open a new Notebook and set it's name to Stable Diffusion XL Base + Refiner
To clear GPU memory and start running the model, navigate to the Kernel menu option in your Jupyter Notebook, and click Shutdown Down All Kernels

To use the model, import the following packages

 import torch
 import matplotlib.pyplot as plt
 from diffusers import DiffusionPipeline

Declare the base model

 base = DiffusionPipeline.from_pretrained(
     "stabilityai/stable-diffusion-xl-base-1.0", 
     torch_dtype=torch.float16, variant="fp16", 
     use_safetensors=True
 )
 base.to("cuda")

Declare the refiner model

 refiner = DiffusionPipeline.from_pretrained(
     "stabilityai/stable-diffusion-xl-refiner-1.0",
     text_encoder_2=base.text_encoder_2,
     vae=base.vae,
     torch_dtype=torch.float16,
     use_safetensors=True,
     variant="fp16",
 )
 refiner.to("cuda")

Because the text encoders and variational autoencoder are the same, you don’t have to load those again for the refiner.

Define the number of inference steps and determine the point at which the model should be passed through the high-noise denoising stage (for example, the base model)
```
 n_steps = 40
 high_noise_frac = 0.8
```
The base model is trained on timesteps 0-999, and the refiner is finetuned from the base model on low noise timesteps 0-199 inclusive, so use the base model for the first 800 timesteps (high noise) and the refiner for the last 200 timesteps (low noise). Hence, high_noise_frac is set to 0.8 so that all steps 200-999 (the first 80% of denoising timesteps) are performed by the base model and steps 0-199 (the last 20% of denoising timesteps) are performed by the refiner model.

Remember, the denoising process starts at high value (high noise) timesteps and ends at low value (low noise) timesteps.

Execute both the pipelines

 prompt = "A majestic lion jumping from a big stone at night"

 base_image = base(
     prompt=prompt,
     num_inference_steps=n_steps,
     denoising_end=high_noise_frac,
     output_type="latent",
 ).images

 refined_image = refiner(
     prompt=prompt,
     num_inference_steps=n_steps,
     denoising_start=high_noise_frac,
     image=base_image,
 ).images

Make sure to set denoising_end and denoising_start to the same values and keep num_inference_steps constant. Also remember that the output of the base model should be in latent space.

Render the generated image
```
 plt.imshow(refined_image[0])
```
The image generated by the prompt A majestic lion jumping from a big stone at night is an example that illustrates the comparison between using a base model and a combined base and refiner model.

However, you won’t be able to inspect the base model’s output because it still contains a large amount of noise.

The above base model image was generated using the Stable Diffusion XL base model with a manual seed. Subsequently, the same manual seed was employed to generate these images for demonstration purposes. If you wish to reproduce images using a manual seed, you can follow the steps below.
Create a new generator

Apply this section with the Stable Diffusion XL Base model, following the same steps used to generate the previous base model image.
```
 prompt = "A majestic lion jumping from a big stone at night"
 generator = torch.Generator("cuda").manual_seed(3078)
 image = pipe(prompt=prompt, generator=generator).images
```
Above code block defines a new generator to the pipeline. Using manual_seed() with a fixed number makes the model produce consistent output. You can pick any integer as the manual seed. By default, the generator generates a new seed for a unique image every time.
Execute both the pipelines with same manual seed

Apply this section with Stable Diffusion XL combine Base and refiner model, following the same steps used to generate the previous model image.
```
 prompt = "A majestic lion jumping from a big stone at night"
 generator = torch.Generator("cuda").manual_seed(3078)

 base_image = base(
     prompt=prompt,
     generator=generator,
     num_inference_steps=n_steps,
     denoising_end=high_noise_frac,
     output_type="latent",
 ).images

 refined_image = refiner(
     prompt=prompt,
     num_inference_steps=n_steps,
     denoising_start=high_noise_frac,
     image=base_image,
 ).images
```
Above code block above generates an image identical to the base model image using the same manual seed that was used to create the base model image. It then refines the image to improve its quality, allowing you to make a comparison between the two images.
You can also access the above code in the following Notebook

Stable Diffusion XL Base + Refiner Notebook

Mask Images by Inpainting

Inpainting is a pre-trained diffusion pipeline from the Diffusers library, designed to fill in missing or damaged regions of an image by predicting these areas based on the surrounding pixels.

Inpainting utilizes the same Stable Diffusion XL base model. The help of a trained diffusion pipeline enables the model to learn how to restore missing or damaged portions of an image while preserving its original style and content.

To generate inpainting images, you’ll need the original image and a mask of what you want to replace in the original image. Create a prompt to describe what you want to replace in the masked area.

Open a new Notebook and set it's name to Stable Diffusion XL Inpaint
To clear GPU memory and start running the model, navigate to the Kernel menu option in your Jupyter Notebook, and click Shutdown Down All Kernels

To use the model, import the following packages

 import torch
 import matplotlib.pyplot as plt
 from diffusers import StableDiffusionXLInpaintPipeline
 from diffusers.utils import load_image

Declare the model

 model_id = "stabilityai/stable-diffusion-xl-base-1.0"
 pipe = StableDiffusionXLInpaintPipeline.from_pretrained(
     model_id, 
     torch_dtype=torch.float16, 
     variant="fp16", 
     use_safetensors=True
 )
 pipe.to("cuda")

Upload the images

 init_img_url = "https://i.imgur.com/AsJ1lPf.png"
 mask_img_url = "https://i.imgur.com/liBhpAv.png"
 init_image = load_image(init_img_url).convert("RGB")
 mask_image = load_image(mask_img_url).convert("RGB")

Generate an image by providing a prompt as below

 prompt = "A majestic tiger sitting on a bench"
 image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=50, strength=0.80).images

Replace A majestic tiger sitting on a bench with your desired prompt

Render the generated image
```
 plt.imshow(image[0])
```
Below is an example of the image showing how the model works:
You can also access the above code in the following Notebook

Stable Diffusion XL Inpaint Notebook

Save Generated Images

Specify the directory you created earlier to save the generated images
```
 save_directory = "/home/jupyter/notebooks/generated_images" 
```
The defined directory must exist on the disk. Verify that you created it in your user home directory.
Using a for loop, save the images
```
 for i, image in enumerate(image):

     image.save(f"{save_directory}/image_{i}.png")
```
The above code saves all generated images to the predefined save_directory path. It uses the save() method to save each image. Saved images are added with filenames in the format image_{i}.png, where {i} refers to the index of the image in the list.
In your terminal session, verify that the images are successfully saved to the directory.
```
 $ ls /home/jupyter/notebooks/generated_images
```
To download a copy of the generated images, you can use a secure transfer protocol like SFTP, FTP, or RSync to fetch the files to your computer.

Additional Parameters

Below is what each parameter used in the model pipelines does:

prompt: Represents the input text prompt that guides the image generation process
negative_prompt: It guides on what to ignore in image generation. If not defined, you need to pass negative_prompt_embeds instead. It's ignored when you're not using guidance guidance_scale < 1
generator: An instance of the torch.Generator class that allows you to control the random number generation
seed: Specifies the random seed used to initialize the model and the data loader. This helps to ensure that the results are reproducible
guidance_scale: It improves adherence to text prompts and affects sample quality. Values between 7 and 8.5 work well, and the default value is 7.5
add_watermarker: Whether to use the invisible_watermark library to watermark output images. If not defined, it will default to True if the package is installed, otherwise no watermarker will be used
images: A list of all generated image objects
height: Sets the height in pixels of the generated image in the pipeline
width: Sets the width in pixels of the generated image in the pipeline
numinferencesteps: It defines the number of steps involved in the inference process. It's recommended to set it to 50 to balance generation speed and result quality. A smaller value leads to faster results and vice versa

Conclusion

In this guide, you generated images using the Stable Diffusion XL (SDXL) model on a Vultr A100 Cloud GPU server. Additionally, you refined the generated images using Refiner, inpainted the mask images using an inpainting pipeline.