How to Use Hugging Face Diffusion Models on Vultr Cloud GPU

Introduction

Diffusers is a Hugging Face library that provides access to pre-trained diffusion models in form of prepackaged pipelines. It offers tools for building and training diffusion models, and includes many different core neural network models used as building blocks to create new pipelines.

This article explains how you can use Hugging Face Diffusion models on a Vultr Cloud GPU server. You are to use a variation of models to generate human-readable results on the server.

Prerequisites

Before you begin:

Deploy a fresh A100 Ubuntu 22.04 Cloud GPU Server on Vultr with at least:
- 20 GB GPU RAM
Using SSH, access the server.
Create a non-root sudo user and switch to the account
Update the server

Install Jupyter Notebook

Jupyter Notebook is an open-source application that offers a web-based development environment to create with live code, visualizations, and equations. To run models interactively on your Vultr Cloud GPU server, install Jupyter Notebook as described in the steps below.

Install the pip package manager
```
 $ sudo apt install python3-pip
```
Using pip, install the Notebook package
```
 $ sudo pip install notebook
```
Open the Jupyter Notebook port 8888 through the firewall to allow access to the web interface
```
 $ sudo ufw allow 8888
```

Start Jupyter Notebook

 $ jupyter notebook --ip=0.0.0.0

The above command starts Jupyter Notebook and allows connections from all server interfaces as declared by 0.0.0.0. When successful, copy the generated access token displayed in your output:

 [I 2023-08-10 12:57:52.455 ServerApp] Jupyter Server 2.7.0 is running at:
 [I 2023-08-10 12:57:52.455 ServerApp] http://HOSTNAME:8888/tree?token=73631c92ba278d265aedeb3b199bd4d48e5ef5b2eed0ae06
 [I 2023-08-10 12:57:52.455 ServerApp]     http://127.0.0.1:8888/tree?token=73631c92ba278d265aedeb3b199bd4d48e5ef5b2eed0ae06
 [I 2023-08-10 12:57:52.455 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

In case the command fails to run, close your SSH session, and start it again to activate Jupyter Notebook

 $ exit

In a web browser such as Chrome, access Jupyter Notebook using your access token. Replace the example IP Address 192.0.2.100 with your actual Server IP
```
 http://192.0.2.100:8888/tree?token=YOUR_TOKEN
```

Using the Models

A pipeline is a high-level interface that packages components required to perform different predefined tasks such as image-generation, image-to-image-generation, and audio-generation. You can run a pipeline by specifying a task and letting it use the default settings for any additional tasks. It's also possible to custom-build a pipeline by specifying the model, tokenizer, and other parameters.

Examples in this article base on image/audio generation models and cover both pipeline approaches. Before loading new models in a Notebook session, it's recommended to close and restart the iPython notebook Kernel. This clears the old models from memory and frees up space for new models.

To run code in a Notebook session, add code in the code cell fields, and press Ctrl + Enter, or press the run button on the main taskbar.

Stable Diffusion V2.1 Model

The stable Diffusion v2.1 model is a fork of the stable-diffusion-2 checkpoint and it's trained with 55 thousand steps on the same dataset. Additionally, it's fine-tuned with 155 thousand extra steps on 768x768 images, in this section, use the model as described in the steps below.

Open a new Jupyter Notebook file. Rename it to stablediffusion

Install the required global packages

 !pip install scipy safetensors matplotlib

To use the model, import the following packages
```
 from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
 import torch
```
The StableDiffusionPipeline class provides an interface to the Stable Diffusion v2.1 model for generating images. DPMSolverMultistepScheduler provides a fast scheduler that generates good outputs with around 20 steps, and torch enables support for GPU tensor computations.
Declare the model
```
 model_id = "stabilityai/stable-diffusion-2-1"
 pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
 pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
 pipe = pipe.to("cuda")
```
The parameters passed to the from_pretrained() method are:
- model_id: Loads the "stabilityai/stable-diffusion-2-1" model. The model ID can also be the path to a local directory containing model weights or a path to a checkpoint file
- torch_dtype: It's the datatype of the tensors used for pipeline computations. bfloat16 specifies that the model computations run in 16-bit (instead of the default 32-bit). To let the system choose the optimal data type using torch_dtype = "auto", by default it's set to torch_dtype = "full-precision"
In diffusion models, a Scheduler de-noises samples by iteratively adding noise during training, and updates samples based on the model outputs during inference. It updates the rule to solve the underlying differential equation.
Generate an image by providing a prompt as below. Replace An astronaut landing on planet with your desired prompt
```
 prompt = "An astronaut landing on planet"
 image = pipe(prompt).images
 image[0]
```
The above code declares and feeds the prompt to the previously declared pipeline. Then, it stores the image attribute and a different image generates each time you run the module. You can enhance the prompt by using details like the camera lens, environment, and include any other relevant information to refine your desired outcome.

Below are the accepted image generation parameters:
- prompt: Represents the input text prompt that guides the image generation process
- generator: An instance of the torch.Generator class that allows you to control the random number generation. Specifying the seed value ensures that the generator produces consistent and deterministic outputs when used repeatedly with the same seed
- guidance_scale: Sets the value of the guidance_scale parameter in the pipeline. It improves adherence to text prompts and affects sample quality. Values between 7 and 8.5 work well, and the default value is 7.5
- images: A list of all generated image objects
- num_inference_steps: Sets the value of num_inference_steps in the pipeline. It defines the number of steps involved in the inference process. By default, it's set to 50 and balances the generation speed and result quality. A smaller value leads to faster results, whereas a larger value enhances quality at the cost of a longer generation time
The An astronaut landing on planet generates an image like the one below:

AudioLDM Model

AudioLDM is a text-to-audio latent diffusion model (LDM) with 1.5 million training steps. The model incorporates over 700 CLAP audio dimensions and 400 million parameters. By taking a text prompt as input, it predicts the corresponding audio output, and generates realistic text-conditional sound effects, human speech, and music samples. Run the model to generate audio results as described in the steps below.

Open a new Jupyter Notebook file. Rename it to audioldm
In a new code cell, install the required packages
```
 !pip install scipy
```

To use the model, import the necessary packages

 from diffusers import AudioLDMPipeline
 import torch

Declare the pipeline
```
 model_id = "cvssp/audioldm-m-full"
 pipe = AudioLDMPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
 pipe = pipe.to("cuda")
```
In the above command, the AudioLDMPipeline instance uses the pre-trained model specified by model_id. torch_dtype=torch.float16 sets the data type to 16-bit floating-point which helps with memory efficiency and faster computations. The pipeline is then moved to the GPU using cuda for faster processing.
Generate audio by providing a prompt. Replace Piano and violin plays with your desired text prompt
```
 prompt = "Piano and violin plays"
 audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]
```
In the above command, the num_inference_steps parameter specifies the number of diffusion steps (iterations) used in the generation process, and audio_length_in_s sets the desired duration of the generated audio in seconds. The resulting audio outputs to the audio variable.
Display the generated audio
```
 from IPython.display import Audio
 Audio(audio, rate=16000)
```
The above code block allows you to play and listen to the generated audio using the Audio function from the iPython library. The rate=16000 argument specifies the sampling rate of the audio set to 16000 samples per second.
Save the audio to a file
```
 import scipy
 scipy.io.wavfile.write("file_name.wav", rate=16000, data=audio)
```
The above code saves the generated audio as a WAV file named file_name.wav using scipy.io.wavfile.write(). The specified sampling rate rate=16000 verifies that the audio saves with the correct sampling rate.

When using the model, the following are the accepted parameters.

prompt: Represents the input text prompt that guides the audio generation process. If not defined, you need to pass prompt_embeds
audio_length_in_s: Sets the value of the audio_length_in_s parameter in the pipeline. The length of the generated audio sample in seconds, with a default value of 5.12 seconds
num_inference_steps: Sets the value of num_inference_steps in the pipeline that defines the number of steps involved in the inference process. By default, it's set to 10 to balance generation speed and result quality. A smaller value of de-noising steps leads to faster results, whereas a larger value enhances quality at the cost of a longer generation time
guidance_scale: Sets the value of the guidance_scale parameter in the pipeline. A higher value encourages the model to generate audio that is closely linked to the text prompt at the expense of lower sound quality. It's enabled when guidance_scale is greater than 1, and the default value is 2.5
negative_prompt: Sets the value of the negative_prompt parameter in the pipeline. It guides on what to ignore in audio generation. If not defined, you need to pass negative_prompt_embeds instead. It's ignored when you're not using guidance guidance_scale < 1
num_waveforms_per_prompt: Sets the value of the num_waveforms_per_prompt parameter in the pipeline. The number of waveforms to generate per prompt, and the default value is 1
eta: Sets the value of the eta parameter in the pipeline. It corresponds to parameter eta (η) from the DDIM paper. It only applies to the DDIMScheduler, and it's ignored in other schedulers with the default value set to 0.0
return_dict: Sets the value of the return_dict parameter in the pipeline to return a StableDiffusionPipelineOutput instead of a plain tuple, the default value is True

Below are other AudioLDM variants with the respective training steps:

audioldm-s-full: 1.5M training steps, no audio conditioning, 768 CLAP audio dim, 128 UNet dim and 421M parameters
audioldm-s-full-v2: More than 1.5M training steps, no audio conditioning, 768 CLAP audio dim, 128 UNet dim and 421M parameters
audioldm-m-full: 1.5M training steps, with audio conditioning, 1024 CLAP audio dim, 192 UNet dim and 652M parameters
audioldm-l-full: 1.5M training steps, no audio conditioning, 768 CLAP audio dim, 256 UNet dim and 975M parameters

Stable Diffusion ControlNet

ControlNet is a neural network structure that controls a pre-trained image Diffusion model by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on human pose estimation. Its function is to allow input of a conditioning image to use and manipulate the image generation process.

It accepts scribbles, edge maps, pose key points, depth maps, segmentation maps, normal maps as the condition input to guide the content of the generated image. In this section, apply the ControlNet model as described in the steps below.

Open a new Jupyter Notebook file. Rename it to sd-controlnet

Install the necessary packages

 !pip install controlnet_aux matplotlib mediapipe

To use the model, import the required packages

 from PIL import Image
 from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler
 import torch
 from controlnet_aux import OpenposeDetector
 from diffusers.utils import load_image

Load an image. Replace https://example.com/image.png with your actual image source
```
 openpose = OpenposeDetector.from_pretrained('lllyasviel/ControlNet')
 image = load_image("https://example.com/image.png")
 image = openpose(image)
```
The above code block loads the pre-trained OpenposeDetector 2and processes the input image from the specified URL. The openpose object estimates the human pose in the image and returns the processed image with pose information. To read your image, make sure it has a target file extension such as .png or .jpg.

Specify the model parameters with 16-bit weights

 controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose", torch_dtype=torch.float16)
 pipe = StableDiffusionControlNetPipeline.from_pretrained(
     "runwayml/stable-diffusion-v1-5", controlnet=controlnet, safety_checker=None, torch_dtype=torch.float16)
 pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

The above code loads the ControlNetModel and the StableDiffusionControlNetPipeline. torch_dtype=torch.float16 sets the data type to 16-bit floating-point for improved memory efficiency and faster computations.

Input a text prompt to generate a new image using the model. Replace Chef in kitchen with your desired prompt
```
 pipe.enable_model_cpu_offload()
 image = pipe("Chef in kitchen", image, num_inference_steps=20).images
 image[0]
```
The above code block uses pipe to generate a new image based on the prompt Chef in kitchen and processes the image with pose information. The num_inference_steps parameter sets the number of diffusion steps used in the generation process. The generated image is then added to the image variable.

The following are the accepted model parameters:
- prompt: Represents the input text prompt that guides the image generation process. When not defined, you have to pass prompt_embedsinstead
- height: Defines the height in pixels of the generated image in the pipeline
- width: Sets the width in pixels of the generated image in the pipeline
- num_inference_steps: Sets the value of num_inference_steps in the pipeline. It defines the number of steps involved in the inference process. By default, it's set to 50, and balances generation speed and the result quality. A smaller value of de-noising steps leads to faster results, and a larger value enhances quality at the cost of a longer generation time
- guidance_scale: Sets the value of the guidance_scale parameter in the pipeline called Classifier-Free Diffusion Guidance. It's enabled by setting guidance_scale > 1. A higher guidance scale generates images that are closely linked to the text prompt, usually at the expense of lower image quality. Values between 7 and 8.5 work well, and the default vale is 7.5
- negative_prompt: Sets the value of the negative_prompt parameter in the pipeline to specify the prompt that should not guide the image generation. When not defined, you have to pass negative_prompt_embeds instead. When not using guidance, it's ignored when the guidance_scale is less than 1
- num_images_per_prompt: Sets the num_images_per_prompt parameter value in the pipeline to determine the number of images to generate per prompt
- prompt_embeds: Sets the prompt_embeds value in the pipeline
- negative_prompt_embeds: Sets the value of the negative_prompt_embeds parameter in the pipeline. You can apply pre-generated negative text embeddings to tweak text inputs, for example, prompt weighting. When not provided, negative_prompt_embeds generates using the negative_prompt input argument
- output_type: Sets the output_type parameter value in the pipeline to define the output format of the generated image. You can choose between PIL, Image, or np.array, with the default value set to PIL
- callback: Sets the callback parameter value in the pipeline
- callback_steps: Sets the callback_steps parameter value in the pipeline, the frequency at which the callback function returns. When not specified, the default value is 1
- controlnet_conditioning_scale: Sets the controlnet_conditioning_scale parameter value in the pipeline. ControlNet outputs multiply by controlnet_conditioning_scale before addition to the residual in the original UNet. When multiple ControlNets are available in init, you can set the corresponding scale as a list with the default set to 1.0
- guess_mode: Sets the guess_mode parameter value in the pipeline. In this mode, the ControlNet encoder tries to recognize the content of the input image even if you remove all prompts. The default value is False, but it's recommended to use a guidance_scale value between 3.0 and 5.0
The "Chef in kitchen" prompt generates an image like the one below:

Below are other variants available for the same model:
- lllyasviel/sd-controlnet-canny: Conditioned on Canny edges
- lllyasviel/sd-controlnet-depth: Conditioned on Depth estimation
- lllyasviel/sd-controlnet-hed: Conditioned on HED Boundary
- lllyasviel/sd-controlnet-mlsd: Conditioned on M-LSD straight line detection
- lllyasviel/sd-controlnet-normal: Conditioned on Normal Map Estimation
- lllyasviel/sd-controlnet-openpose: Conditioned on Human Pose Estimation
- lllyasviel/sd-controlnet-scribble: Conditioned on Scribble images
- lllyasviel/sd-controlnet-seg: Conditioned on Image Segmentation

Conclusion

In this article, you implemented Hugging Face diffusion models and used the models to generate results. To use other diffusion models, visit the respective model card pages to learn how to use them. Additionally, studying the model's documentation provides valuable insights into its specific details and configuration options.