How to Use Hugging Face Diffusion Models on Vultr Cloud GPU
Introduction
Diffusers is a Hugging Face library that provides access to pre-trained diffusion models in form of prepackaged pipelines. It offers tools for building and training diffusion models, and includes many different core neural network models used as building blocks to create new pipelines.
This article explains how you can use Hugging Face Diffusion models on a Vultr Cloud GPU server. You are to use a variation of models to generate human-readable results on the server.
Prerequisites
Before you begin:
- Deploy a fresh A100 Ubuntu 22.04 Cloud GPU Server on Vultr with at least:
- 20 GB GPU RAM
- Using SSH, access the server.
- Create a non-root sudo user and switch to the account
- Update the server
Install Jupyter Notebook
Jupyter Notebook is an open-source application that offers a web-based development environment to create with live code, visualizations, and equations. To run models interactively on your Vultr Cloud GPU server, install Jupyter Notebook as described in the steps below.
Install the
pip
package manager$ sudo apt install python3-pip
Using
pip
, install the Notebook package$ sudo pip install notebook
Open the Jupyter Notebook port
8888
through the firewall to allow access to the web interface$ sudo ufw allow 8888
Start Jupyter Notebook
$ jupyter notebook --ip=0.0.0.0
The above command starts Jupyter Notebook and allows connections from all server interfaces as declared by
0.0.0.0
. When successful, copy the generated access token displayed in your output:[I 2023-08-10 12:57:52.455 ServerApp] Jupyter Server 2.7.0 is running at: [I 2023-08-10 12:57:52.455 ServerApp] http://HOSTNAME:8888/tree?token=73631c92ba278d265aedeb3b199bd4d48e5ef5b2eed0ae06 [I 2023-08-10 12:57:52.455 ServerApp] http://127.0.0.1:8888/tree?token=73631c92ba278d265aedeb3b199bd4d48e5ef5b2eed0ae06 [I 2023-08-10 12:57:52.455 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
In case the command fails to run, close your SSH session, and start it again to activate Jupyter Notebook
$ exit
In a web browser such as Chrome, access Jupyter Notebook using your access token. Replace the example IP Address
192.0.2.100
with your actual Server IPhttp://192.0.2.100:8888/tree?token=YOUR_TOKEN
Using the Models
A pipeline is a high-level interface that packages components required to perform different predefined tasks such as image-generation
, image-to-image-generation
, and audio-generation
. You can run a pipeline by specifying a task and letting it use the default settings for any additional tasks. It's also possible to custom-build a pipeline by specifying the model, tokenizer, and other parameters.
Examples in this article base on image/audio generation models and cover both pipeline approaches. Before loading new models in a Notebook session, it's recommended to close and restart the iPython notebook Kernel. This clears the old models from memory and frees up space for new models.
To run code in a Notebook session, add code in the code cell fields, and press Ctrl + Enter, or press the run button on the main taskbar.
Stable Diffusion V2.1 Model
The stable Diffusion v2.1 model is a fork of the stable-diffusion-2
checkpoint and it's trained with 55 thousand steps on the same dataset. Additionally, it's fine-tuned with 155 thousand extra steps on 768x768
images, in this section, use the model as described in the steps below.
Open a new Jupyter Notebook file. Rename it to
stablediffusion
Install the required global packages
!pip install scipy safetensors matplotlib
To use the model, import the following packages
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler import torch
The
StableDiffusionPipeline
class provides an interface to the Stable Diffusion v2.1 model for generating images.DPMSolverMultistepScheduler
provides a fast scheduler that generates good outputs with around 20 steps, andtorch
enables support for GPU tensor computations.Declare the model
model_id = "stabilityai/stable-diffusion-2-1" pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) pipe = pipe.to("cuda")
The parameters passed to the
from_pretrained()
method are:model_id
: Loads the"stabilityai/stable-diffusion-2-1"
model. The model ID can also be the path to a local directory containing model weights or a path to a checkpoint filetorch_dtype
: It's the datatype of the tensors used for pipeline computations.bfloat16
specifies that the model computations run in 16-bit (instead of the default 32-bit). To let the system choose the optimal data type usingtorch_dtype = "auto"
, by default it's set totorch_dtype = "full-precision"
In diffusion models, a
Scheduler
de-noises samples by iteratively adding noise during training, and updates samples based on the model outputs during inference. It updates the rule to solve the underlying differential equation.Generate an image by providing a prompt as below. Replace
An astronaut landing on planet
with your desired promptprompt = "An astronaut landing on planet" image = pipe(prompt).images image[0]
The above code declares and feeds the prompt to the previously declared pipeline. Then, it stores the image attribute and a different image generates each time you run the module. You can enhance the prompt by using details like the camera lens, environment, and include any other relevant information to refine your desired outcome.
Below are the accepted image generation parameters:
- prompt: Represents the input text prompt that guides the image generation process
- generator: An instance of the
torch.Generator
class that allows you to control the random number generation. Specifying the seed value ensures that the generator produces consistent and deterministic outputs when used repeatedly with the same seed - guidance_scale: Sets the value of the
guidance_scale
parameter in the pipeline. It improves adherence to text prompts and affects sample quality. Values between7
and8.5
work well, and the default value is7.5
- images: A list of all generated image objects
- num_inference_steps: Sets the value of
num_inference_steps
in the pipeline. It defines the number of steps involved in the inference process. By default, it's set to50
and balances the generation speed and result quality. A smaller value leads to faster results, whereas a larger value enhances quality at the cost of a longer generation time
The
An astronaut landing on planet
generates an image like the one below:
AudioLDM Model
AudioLDM is a text-to-audio latent diffusion model (LDM) with 1.5 million training steps. The model incorporates over 700 CLAP audio dimensions and 400 million parameters. By taking a text prompt as input, it predicts the corresponding audio output, and generates realistic text-conditional sound effects, human speech, and music samples. Run the model to generate audio results as described in the steps below.
Open a new Jupyter Notebook file. Rename it to
audioldm
In a new code cell, install the required packages
!pip install scipy
To use the model, import the necessary packages
from diffusers import AudioLDMPipeline import torch
Declare the pipeline
model_id = "cvssp/audioldm-m-full" pipe = AudioLDMPipeline.from_pretrained(model_id, torch_dtype=torch.float16) pipe = pipe.to("cuda")
In the above command, the
AudioLDMPipeline
instance uses the pre-trained model specified bymodel_id
.torch_dtype=torch.float16
sets the data type to 16-bit floating-point which helps with memory efficiency and faster computations. The pipeline is then moved to the GPU usingcuda
for faster processing.Generate audio by providing a prompt. Replace
Piano and violin plays
with your desired text promptprompt = "Piano and violin plays" audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]
In the above command, the
num_inference_steps
parameter specifies the number of diffusion steps (iterations) used in the generation process, andaudio_length_in_s
sets the desired duration of the generated audio in seconds. The resulting audio outputs to theaudio
variable.Display the generated audio
from IPython.display import Audio Audio(audio, rate=16000)
The above code block allows you to play and listen to the generated audio using the
Audio
function from the iPython library. Therate=16000
argument specifies the sampling rate of the audio set to 16000 samples per second.Save the audio to a file
import scipy scipy.io.wavfile.write("file_name.wav", rate=16000, data=audio)
The above code saves the generated audio as a WAV file named
file_name.wav
usingscipy.io.wavfile.write()
. The specified sampling raterate=16000
verifies that the audio saves with the correct sampling rate.
When using the model, the following are the accepted parameters.
- prompt: Represents the input text prompt that guides the audio generation process. If not defined, you need to pass
prompt_embeds
- audio_length_in_s: Sets the value of the
audio_length_in_s
parameter in the pipeline. The length of the generated audio sample in seconds, with a default value of 5.12 seconds - num_inference_steps: Sets the value of
num_inference_steps
in the pipeline that defines the number of steps involved in the inference process. By default, it's set to10
to balance generation speed and result quality. A smaller value of de-noising steps leads to faster results, whereas a larger value enhances quality at the cost of a longer generation time - guidance_scale: Sets the value of the
guidance_scale
parameter in the pipeline. A higher value encourages the model to generate audio that is closely linked to the text prompt at the expense of lower sound quality. It's enabled whenguidance_scale
is greater than1
, and the default value is2.5
- negative_prompt: Sets the value of the
negative_prompt
parameter in the pipeline. It guides on what to ignore in audio generation. If not defined, you need to passnegative_prompt_embeds
instead. It's ignored when you're not using guidanceguidance_scale < 1
- num_waveforms_per_prompt: Sets the value of the
num_waveforms_per_prompt
parameter in the pipeline. The number of waveforms to generate per prompt, and the default value is1
- eta: Sets the value of the
eta
parameter in the pipeline. It corresponds to parametereta (η)
from the DDIM paper. It only applies to theDDIMScheduler
, and it's ignored in other schedulers with the default value set to0.0
- return_dict: Sets the value of the
return_dict
parameter in the pipeline to return aStableDiffusionPipelineOutput
instead of a plain tuple, the default value isTrue
Below are other AudioLDM variants with the respective training steps:
audioldm-s-full
: 1.5M training steps, no audio conditioning, 768 CLAP audio dim, 128 UNet dim and 421M parametersaudioldm-s-full-v2
: More than 1.5M training steps, no audio conditioning, 768 CLAP audio dim, 128 UNet dim and 421M parametersaudioldm-m-full
: 1.5M training steps, with audio conditioning, 1024 CLAP audio dim, 192 UNet dim and 652M parametersaudioldm-l-full
: 1.5M training steps, no audio conditioning, 768 CLAP audio dim, 256 UNet dim and 975M parameters
Stable Diffusion ControlNet
ControlNet is a neural network structure that controls a pre-trained image Diffusion model by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on human pose estimation. Its function is to allow input of a conditioning image to use and manipulate the image generation process.
It accepts scribbles, edge maps, pose key points, depth maps, segmentation maps, normal maps as the condition input to guide the content of the generated image. In this section, apply the ControlNet model as described in the steps below.
Open a new Jupyter Notebook file. Rename it to
sd-controlnet
Install the necessary packages
!pip install controlnet_aux matplotlib mediapipe
To use the model, import the required packages
from PIL import Image from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler import torch from controlnet_aux import OpenposeDetector from diffusers.utils import load_image
Load an image. Replace
https://example.com/image.png
with your actual image sourceopenpose = OpenposeDetector.from_pretrained('lllyasviel/ControlNet') image = load_image("https://example.com/image.png") image = openpose(image)
The above code block loads the pre-trained OpenposeDetector 2and processes the input image from the specified URL. The
openpose
object estimates the human pose in the image and returns the processed image with pose information. To read your image, make sure it has a target file extension such as.png
or.jpg
.Specify the model parameters with 16-bit weights
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose", torch_dtype=torch.float16) pipe = StableDiffusionControlNetPipeline.from_pretrained( "runwayml/stable-diffusion-v1-5", controlnet=controlnet, safety_checker=None, torch_dtype=torch.float16) pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
The above code loads the
ControlNetModel
and theStableDiffusionControlNetPipeline
.torch_dtype=torch.float16
sets the data type to 16-bit floating-point for improved memory efficiency and faster computations.Input a text prompt to generate a new image using the model. Replace
Chef in kitchen
with your desired promptpipe.enable_model_cpu_offload() image = pipe("Chef in kitchen", image, num_inference_steps=20).images image[0]
The above code block uses
pipe
to generate a new image based on the promptChef in kitchen
and processes the image with pose information. Thenum_inference_steps
parameter sets the number of diffusion steps used in the generation process. The generatedimage
is then added to the image variable.The following are the accepted model parameters:
- prompt: Represents the input text prompt that guides the image generation process. When not defined, you have to pass
prompt_embeds
instead - height: Defines the height in pixels of the generated image in the pipeline
- width: Sets the width in pixels of the generated image in the pipeline
- num_inference_steps: Sets the value of
num_inference_steps
in the pipeline. It defines the number of steps involved in the inference process. By default, it's set to50
, and balances generation speed and the result quality. A smaller value of de-noising steps leads to faster results, and a larger value enhances quality at the cost of a longer generation time - guidance_scale: Sets the value of the
guidance_scale
parameter in the pipeline called Classifier-Free Diffusion Guidance. It's enabled by settingguidance_scale > 1
. A higher guidance scale generates images that are closely linked to the text prompt, usually at the expense of lower image quality. Values between7
and8.5
work well, and the default vale is7.5
- negative_prompt: Sets the value of the
negative_prompt
parameter in the pipeline to specify the prompt that should not guide the image generation. When not defined, you have to passnegative_prompt_embeds
instead. When not using guidance, it's ignored when theguidance_scale
is less than1
- num_images_per_prompt: Sets the
num_images_per_prompt
parameter value in the pipeline to determine the number of images to generate per prompt - prompt_embeds: Sets the
prompt_embeds
value in the pipeline - negative_prompt_embeds: Sets the value of the
negative_prompt_embeds
parameter in the pipeline. You can apply pre-generated negative text embeddings to tweak text inputs, for example, prompt weighting. When not provided,negative_prompt_embeds
generates using thenegative_prompt
input argument - output_type: Sets the
output_type
parameter value in the pipeline to define the output format of the generated image. You can choose betweenPIL
,Image
, ornp.array
, with the default value set toPIL
- callback: Sets the
callback
parameter value in the pipeline - callback_steps: Sets the
callback_steps
parameter value in the pipeline, the frequency at which the callback function returns. When not specified, the default value is1
- controlnet_conditioning_scale: Sets the
controlnet_conditioning_scale
parameter value in the pipeline. ControlNet outputs multiply bycontrolnet_conditioning_scale
before addition to the residual in the original UNet. When multiple ControlNets are available in init, you can set the corresponding scale as a list with the default set to1.0
- guess_mode: Sets the
guess_mode
parameter value in the pipeline. In this mode, the ControlNet encoder tries to recognize the content of the input image even if you remove all prompts. The default value isFalse
, but it's recommended to use a guidance_scale value between3.0
and5.0
The "Chef in kitchen" prompt generates an image like the one below:
Below are other variants available for the same model:
lllyasviel/sd-controlnet-canny
: Conditioned on Canny edgeslllyasviel/sd-controlnet-depth
: Conditioned on Depth estimationlllyasviel/sd-controlnet-hed
: Conditioned on HED Boundarylllyasviel/sd-controlnet-mlsd
: Conditioned on M-LSD straight line detectionlllyasviel/sd-controlnet-normal
: Conditioned on Normal Map Estimationlllyasviel/sd-controlnet-openpose
: Conditioned on Human Pose Estimationlllyasviel/sd-controlnet-scribble
: Conditioned on Scribble imageslllyasviel/sd-controlnet-seg
: Conditioned on Image Segmentation
- prompt: Represents the input text prompt that guides the image generation process. When not defined, you have to pass
Conclusion
In this article, you implemented Hugging Face diffusion models and used the models to generate results. To use other diffusion models, visit the respective model card pages to learn how to use them. Additionally, studying the model's documentation provides valuable insights into its specific details and configuration options.
More Information
For more information, visit the following documentation resources: