AI Generated Images with Stable Cascade and Vultr Cloud GPU

Updated on March 17, 2024
AI Generated Images with Stable Cascade and Vultr Cloud GPU header image

Introduction

Stable Cascade is a text-to-image generation model developed by Stability AI based on the Würstchen architecture. The model uses a three-stage architecture in the A, B, C format for greater efficiency and image generation quality. As a result, all known model extensions such as finetuning, LoRA, ControlNet, IP-Adapter, and LCM are possible with Stable Cascade making it an efficient model you can adopt in multiple projects.

This guide explains how to achieve AI-generated images with the Stable Cascade model on a Vultr Cloud GPU server. You will install the required packages to create a development environment to run the Stable Cascade model. Then, generate images with Text-to-Image, Image-to-Image, and Image Variation methods in addition to exploring the model's performance benefits and limitations.

Prerequisites

Before you begin:

Stable Cascade Architecture

Stable Cascade consists of three models divided into stages that include Stage A, Stage B, and Stage C. Stage A and B support image compression similar to the role of the VAE (Variational Autoencoder) in Stable Diffusion while Stage C focuses on the generated resolution as described in the following sequence operations:

  • Stage A: The Encoder Network imports an image (often low-resolution) and extracts key features. Then, it compresses the features into a latent code with a 42x data size reduction. Condensed data is stored to a low-dimensional space called latent space and runs as the Stage B input.
  • Stage B: The Decoder Network expands the data size, adds details, and progressively fills-in the missing information before forwarding it to the Latent Generator.
  • Stage C: The Latent Generator receives both the compressed latent code (from Stage B) and an input text prompt that describes the generated image. The stage iteratively refines the latent code, gradually transforming it into a high-resolution image that aligns with your prompt.

Install the Stable Cascade Model

To implement the Stable Cascade model, access the Jupyter Lab interface, open a new terminal session, and clone the Stable Cascade model repository to your server. Then, download the model checkpoints in a Safetensors format, and install the required Python packages as described in the following steps.

  1. Open a new JupyterLab terminal session.

    Open a new JupyterLab terminal

  2. Switch to the Jupyter user home directory.

    console
    $ cd
    
  3. Create a new directory to store the model generated images.

    console
    $ mkdir /home/jupyter/notebooks/generated_images
    
  4. Clone the Stable Cascade repository using Git.

    console
    $ git clone https://github.com/Stability-AI/StableCascade.git
    
  5. Switch to the new Stable Cascade directory.

    console
    $ cd StableCascade
    
  6. Install all necessary dependencies using the requirements.txt file.

    console
    $ pip install -r requirements.txt
    
  7. When successful, switch to the models directory.

    console
    $ cd models
    
  8. Execute the download_models.sh file with the target dependency models to install on your server.

    console
    $ bash download_models.sh essential big-big bfloat16
    
  9. Switch back to the main Stable Cascade directory.

    console
    $ cd ..
    
  10. Print the working directory path to use in your model configurations.

    console
    $ pwd
    

    Output:

    /home/jupyter/StableCascade/

Set Up the Stable Cascade Model Environment to Perform Image Generation Tasks

To generate images with the Stable Cascade model, create a new Jupyter Notebook file to use as the implementation environment. Then, import the required modules and load the model configuration files to activate as described in the steps below.

  1. Create a new Jupyter Notebook Python3 Kernel file and set its name to Stable Cascade Text-to-Image.

    Create a new Jupyter notebook file

  2. Import the required packages in a new code cell. Replace /home/jupyter/StableCascade/ with your actual model directory.

    Python
    import os
    import yaml
    import torch
    import torchvision
    from tqdm import tqdm
    from PIL import Image
    
    os.chdir('/home/jupyter/StableCascade/')
    from inference.utils import *
    from core.utils import load_or_fail
    from train import WurstCoreC, WurstCoreB
    

    Below are the tasks performed by each of the imported packages:

    • os: Enables operating system-dependent functionalities such as reading and writing files to the file system.
    • yaml: Enables YAML file parsing.
    • torch: Runs general PyTorch functionalities.
    • torchvision: Includes datasets, model architectures, and image transformations for computer vision tasks.
    • tqdm: Provides a fast extensible progress bar for loops and other iterable computations.
    • PIL: The Python Imaging Library (PIL) opens, manipulates, and enables the export of multiple image file formats.
    • os.chdir(): Changes the working directory to the specified path to import functions and classes from custom modules.
    • inference.utils: Imports all functions and classes from the utils module in the inference package.
    • load_or_fail: Imports the load_or_fail function from the utils module in the core package.
    • WurstCoreC, WurstCoreB: Imports the WurstCoreC and WurstCoreB classes from the train module.
  3. Press Shift + Enter to run the code cell and import all packages.

  4. Verify the system GPU availability status.

    Python
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    print(device)
    

    Output:

    cuda:0

    The above code queries the NVidia GPU availability status with the torch.cuda.is_available() function and displays the result with a print function. The output value 0 means a GPU device is available and ready to use while any other value switches the default device to CPU.

  5. Load the Stage C model configuration file.

    Python
    config_file = 'configs/inference/stage_c_3b.yaml'
    with open(config_file, "r", encoding="utf-8") as file:
        loaded_config = yaml.safe_load(file)
    
    core = WurstCoreC(config_dict=loaded_config, device=device, training=False)
    

    Within the above code, config_file stores the stage_c_3b.yaml stage path that contains settings and parameters for the Stage C model.

  6. Load the Stage B model configuration file.

    Python
    config_file_b = 'configs/inference/stage_b_3b.yaml'
    with open(config_file_b, "r", encoding="utf-8") as file:
        config_file_b = yaml.safe_load(file)
    
    core_b = WurstCoreB(config_dict=config_file_b, device=device, training=False)
    

    Within the above code, the config_file_b variable stores the stage_b_3b.yaml configuration path that contains settings and parameters for the Stage B model.

  7. Load the model extras to set up Stage C.

    Python
    extras = core.setup_extras_pre()
    models = core.setup_models(extras)
    models.generator.eval().requires_grad_(False)
    print("STAGE C READY")
    

    The above code prepares the Stage C model with the following values:

    • setup_extras_pre(): Calls the core function in the WurstCoreC package.
    • extras: Contains setup details.
    • models: Contains the initialized generator, discriminator, and tokenizer.
    • models.generator.eval(): Sets the generator to evaluation mode while .requires_grad_(False) disables gradient computation for its parameters, saving computational resources.
  8. Load the model extras to set up Stage B.

    Python
    extras_b = core_b.setup_extras_pre()
    models_b = core_b.setup_models(extras_b, skip_clip=True)
    models_b = WurstCoreB.Models(
       **{**models_b.to_dict(), 'tokenizer': models.tokenizer, 'text_model': models.text_model}
    )
    models_b.generator.bfloat16().eval().requires_grad_(False)
    print("STAGE B READY")
    

    The above code prepares the Stage B model similar to Stage C with the following functions:

    • WurstCoreB.Models: Creates a new instance with the model details including models and models_b.
    • models_b.generator.bfloat16(): Converts the generator parameters to 16-bit floating-point precision.
    • .eval().requires_grad_(False): Sets the model to evaluation mode, preventing gradient computation during inference.

Generate Images from Text Prompts with Stable Cascade

  1. Create a batch size and a text prompt to describe the image generation process. Replace Anthropomorphic cat dressed as a pilot with your desired text prompt.

    Python
    batch_size = 4
    caption = "Anthropomorphic cat dressed as a pilot"
    

    In the above code, the batch_size variable has a value of 4. This means that the code will generate images in 4 batches using the caption variable that includes a text prompt Anthropomorphic cat dressed as a pilot for generating the image.

  2. Define the target image height and width.

    Python
    height, width = 1024, 1024
    stage_c_latent_shape, stage_b_latent_shape = calculate_latent_sizes(height, width, batch_size=batch_size)
    

    The above code sets the height and width variables to 1024 pixels to set the generated image dimensions. The calculate_latent_sizes function computes latent sizes for Stage C and Stage B based on the specified height, width, and batch size.

  3. Define the Stage C and Stage B parameters.

    Python
    # Stage C Parameters
    extras.sampling_configs['cfg'] = 4
    extras.sampling_configs['shift'] = 2
    extras.sampling_configs['timesteps'] = 20
    extras.sampling_configs['t_start'] = 1.0
    
    # Stage B Parameters
    extras_b.sampling_configs['cfg'] = 1.1
    extras_b.sampling_configs['shift'] = 1
    extras_b.sampling_configs['timesteps'] = 10
    extras_b.sampling_configs['t_start'] = 1.0
    

    The above model parameters control the sampling process when generating an image.

  4. Add the model conditions.

    Python
    batch = {'captions': [caption] * batch_size}
    conditions = core.get_conditions(batch, models, extras, is_eval=True, is_unconditional=False, eval_image_embeds=False)
    unconditions = core.get_conditions(batch, models, extras, is_eval=True, is_unconditional=True, eval_image_embeds=False)    
    conditions_b = core_b.get_conditions(batch, models_b, extras_b, is_eval=True, is_unconditional=False)
    unconditions_b = core_b.get_conditions(batch, models_b, extras_b, is_eval=True, is_unconditional=True)
    

    The above code prepares conditions and unconditional inputs for both Stage C and Stage B that are required when generating images conditioned to a specific caption.

  5. Set TOKENIZERS_PARALLELISM environment variable to false to disable parallelism while generating the image.

    Python
    os.environ["TOKENIZERS_PARALLELISM"] = "false"
    
  6. Start the image generation process.

    Python
    with torch.no_grad(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
        sampling_c = extras.gdf.sample(
            models.generator, conditions, stage_c_latent_shape,
            unconditions, device=device, **extras.sampling_configs,
        )
        for (sampled_c, _, _) in tqdm(sampling_c, total=extras.sampling_configs['timesteps']):
            sampled_c = sampled_c
        conditions_b['effnet'] = sampled_c
        unconditions_b['effnet'] = torch.zeros_like(sampled_c)
        sampling_b = extras_b.gdf.sample(
            models_b.generator, conditions_b, stage_b_latent_shape,
            unconditions_b, device=device, **extras_b.sampling_configs
        )
        for (sampled_b, _, _) in tqdm(sampling_b, total=extras_b.sampling_configs['timesteps']):
            sampled_b = sampled_b
        sampled = models_b.stage_a.decode(sampled_b).float()
    

    The above code generates images with two stages Stage C and Stage B using the specified models, conditions, and sampling configurations. The torch.cuda.amp.autocast performs automatic mixed-precision tasks to speed up the model computations.

  7. View the generated images.

    Python
    show_images(sampled)
    

    Based on the input text prompt Anthropomorphic cat dressed as a pilot, the model generates a photo-realistic image with the set resolution. For each text prompt, the model generates a different image.

    Generated Image

Perform Image-to-Image Generation with Stable Cascade

Stable Cascade image-to-image generation works similar to the text-to-image model process with a three-stage architecture. Instead of a text prompt, image-to-image uses an input image and the involves adding noise to a specific point. Follow the steps below to generate images using the image-to-image procedure, add noise and start the model image generation process.

  1. Click Kernel on the top navigation bar and select Shutdown Down All Kernels from the list of options to clear the GPU memory.

    Image of new notebook

  2. In a new code cell, define the batch size and the input image URL to use with the model. Replace https://imagizer.imageshack.com/img922/1920/iquupP.png with your desired input image URL.

    Python
    batch_size = 4
    url = "https://imagizer.imageshack.com/img922/1920/iquupP.png"
    images = resize_image(download_image(url)).unsqueeze(0).expand(batch_size, -1, -1, -1).to(device)
    
    batch = {'images': images}
    
  3. Enter a text prompt to define the generated image with the target width and height to set the image dimensions. Replace a person riding a rodent and 1023 with your desired prompt, and target image dimensions respectively.

    Python
    caption = "a person riding a rodent"
    noise_level = 0.8
    height, width = 1024, 1024
    stage_c_latent_shape, stage_b_latent_shape = calculate_latent_sizes(height, width, batch_size=batch_size)
    effnet_latents = core.encode_latents(batch, models, extras)
    t = torch.ones(effnet_latents.size(0), device=device) * noise_level
    noised = extras.gdf.diffuse(effnet_latents, t=t)[0]
    
  4. Define the Stage C and Stage B parameters.

    Python
    # Stage C Parameters
    extras.sampling_configs['cfg'] = 4
    extras.sampling_configs['shift'] = 2
    extras.sampling_configs['timesteps'] = int(20 * noise_level)
    extras.sampling_configs['t_start'] = noise_level
    extras.sampling_configs['x_init'] = noised
    
    # Stage B Parameters
    extras_b.sampling_configs['cfg'] = 1.1
    extras_b.sampling_configs['shift'] = 1
    extras_b.sampling_configs['timesteps'] = 10
    extras_b.sampling_configs['t_start'] = 1.0
    
  5. Prepare the image generation conditions.

    Python
    batch['captions'] = [caption] * batch_size
    
  6. Set the TOKENIZERS_PARALLELISM environment variable to false to disable parallelism.

    Python
    os.environ["TOKENIZERS_PARALLELISM"] = "false"
    
  7. Generate the images based on your input image and prompt.

    Python
    with torch.no_grad(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
        conditions = core.get_conditions(batch, models, extras, is_eval=True, is_unconditional=False, eval_image_embeds=False)
        unconditions = core.get_conditions(batch, models, extras, is_eval=True, is_unconditional=True, eval_image_embeds=False)    
        conditions_b = core_b.get_conditions(batch, models_b, extras_b, is_eval=True, is_unconditional=False)
        unconditions_b = core_b.get_conditions(batch, models_b, extras_b, is_eval=True, is_unconditional=True)
        sampling_c = extras.gdf.sample(
            models.generator, conditions, stage_c_latent_shape,
            unconditions, device=device, **extras.sampling_configs,
        )
        for (sampled_c, _, _) in tqdm(sampling_c, total=extras.sampling_configs['timesteps']):
            sampled_c = sampled_c
        conditions_b['effnet'] = sampled_c
        unconditions_b['effnet'] = torch.zeros_like(sampled_c)
        sampling_b = extras_b.gdf.sample(
            models_b.generator, conditions_b, stage_b_latent_shape,
            unconditions_b, device=device, **extras_b.sampling_configs
        )
        for (sampled_b, _, _) in tqdm(sampling_b, total=extras_b.sampling_configs['timesteps']):
            sampled_b = sampled_b
        sampled = models_b.stage_a.decode(sampled_b).float()
    
  8. View the generated images.

    Python
    show_images(sampled)
    

    Based on the input image, dimensions, and the text prompt a person riding a rodent. The model outputs a final image that closely matches your input values.

    View the Stable Cascade Generated Image

Image Variation

Image variation enables the Stable Cascade model to comprehend image embeddings and generate variations without a base prompt. Similar to the image-to-image generation process, image variation uses an input image but does not require a prompt to generate a final image as described in the steps below.

  1. Click the Kernel menu option and select Shutdown Down All Kernels to clear the system GPU memory to run a new model process.

    Open a new notebook

  2. In a new code cell, define your target batch size and input image URL. Replace https://imagizer.imageshack.com/img923/8748/1Lo6Ii.png with your desired image URL to use with the model.

    Python
    batch_size = 4
    url = "https://imagizer.imageshack.com/img923/8748/1Lo6Ii.png"
    images = resize_image(download_image(url)).unsqueeze(0).expand(batch_size, -1, -1, -1).to(device)
    batch = {'images': images}
    
  3. Define the target width and height to set the generated image dimensions.

    Python
    caption = ""
    height, width = 1024, 1024
    stage_c_latent_shape, stage_b_latent_shape = calculate_latent_sizes(height, width, batch_size=batch_size)
    
  4. Add the model Stage C parameters.

    Python
    # Stage C Parameters
    extras.sampling_configs['cfg'] = 4
    extras.sampling_configs['shift'] = 2
    extras.sampling_configs['timesteps'] = 20
    extras.sampling_configs['t_start'] = 1.0
    
    # Stage B Parameters
    extras_b.sampling_configs['cfg'] = 1.1
    extras_b.sampling_configs['shift'] = 1
    extras_b.sampling_configs['timesteps'] = 10
    extras_b.sampling_configs['t_start'] = 1.0
    
  5. Prepare the model image generation conditions.

    Python
    batch['captions'] = [caption] * batch_size
    
  6. Disable parallelism.

    Python
    os.environ["TOKENIZERS_PARALLELISM"] = "false"
    
  7. Generate the images.

    Python
    with torch.no_grad(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
        conditions = core.get_conditions(batch, models, extras, is_eval=True, is_unconditional=False, eval_image_embeds=True)
        unconditions = core.get_conditions(batch, models, extras, is_eval=True, is_unconditional=True, eval_image_embeds=False)    
        conditions_b = core_b.get_conditions(batch, models_b, extras_b, is_eval=True, is_unconditional=False)
        unconditions_b = core_b.get_conditions(batch, models_b, extras_b, is_eval=True, is_unconditional=True)
        sampling_c = extras.gdf.sample(
            models.generator, conditions, stage_c_latent_shape,
            unconditions, device=device, **extras.sampling_configs,
        )
        for (sampled_c, _, _) in tqdm(sampling_c, total=extras.sampling_configs['timesteps']):
            sampled_c = sampled_c
        conditions_b['effnet'] = sampled_c
        unconditions_b['effnet'] = torch.zeros_like(sampled_c)
        sampling_b = extras_b.gdf.sample(
            models_b.generator, conditions_b, stage_b_latent_shape,
            unconditions_b, device=device, **extras_b.sampling_configs
        )
        for (sampled_b, _, _) in tqdm(sampling_b, total=extras_b.sampling_configs['timesteps']):
            sampled_b = sampled_b
        sampled = models_b.stage_a.decode(sampled_b).float()
    
  8. View the generated images.

    Python
    show_images(sampled)
    

    Based on your input image, the Stable Cascade image variation model process outputs a generated image that matches your condition specifications.

    A Stable Cascade Image Variation Generated Image

Save Generated Images

All Stable Cascade image generation processes generate final images but do not save the result. Follow the steps below to export and save generated images to your data directory.

  1. Import the required model packages to save generated images.

    Python
    import matplotlib.pyplot as plt
    import torchvision.utils as vutils
    
  2. Set your target server directory to save the generated images. Replace /home/jupyter/notebooks/generated_images with your desired directory path.

    Python
    save_dir = "/home/jupyter/notebooks/generated_images"
    
  3. Declare a new condition to create the directory if it does not exist on the server.

    Python
    os.makedirs(save_dir, exist_ok=True)
    
  4. Create a new loop to scan each generated image in the batch and save it as a standalone file.

    Python
    for i in range(sampled.size(0)):
        # Extract the i-th image from the batch
        current_image = sampled[i]
    
        # Convert the tensor to a grid of images
        grid_image = vutils.make_grid(current_image, normalize=True, scale_each=True)
    
        # Convert PyTorch tensor to NumPy array and transpose the dimensions
        grid_image_np = grid_image.cpu().numpy().transpose((1, 2, 0))
    
        # Save the generated image to a file in the specified directory
        save_path = os.path.join(save_dir, f'generated_image_{i + 1}.png')
        plt.imsave(save_path, grid_image_np)
    

    Within the above code, for i in range starts a new loop that iterates through each of the generated images. For each image:

    • Extract Image: Retrieves each image from the batch using current_image = sampled[i].
    • Convert to Grid: Uses the vutils.make_grid function to convert the image tensor to a grid format suitable for visualization.
    • Convert to NumPy Array: Converts the grid tensor to a NumPy array and transposes the dimensions to match the Matplotlib format.
    • Save Image: Saves the generated image as a PNG file in the specified directory with an increment value index + 1 to avoid starting new image files from zero and ensure uniqueness for each filename.

    To download the exported Stable Cascade images to your local machine, use a file transfer protocol such as SFTP, SCP or Rsync.

Performance Benefits

In comparison with other image generation models, Stable Cascade consistently outperforms other variants in prompt alignment and aesthetic quality. Particularly, Stable Cascade (30 inference steps) exhibits superior performance compared to Playground v2 (50 inference steps), SDXL (50 inference steps), SDXL Turbo (1 inference step), and Würstchen v2 (30 inference steps) based on the initial model tests.

Image of Stable Cascade model comparison

It's important to note that the model results are specifically related to text-to-image generation.

Conclusion

You have generated AI images using the Stable Cascade model on a Vultr Cloud GPU server with text-to-image, image-to-image, and image variation methods. Based on your model deployment needs, you can use Stable Cascade with multiple input prompts and images to finetune it to your environment. For more information and usage samples, visit the Stable Cascade model page on Hugging Face.