How to Deploy Deepseek V3 Large Language Model (LLM) Using SGLang

Updated on February 1, 2025
How to Deploy Deepseek V3 Large Language Model (LLM) Using SGLang header image

Deepseek V3 is a high-performance Mixture-of-Experts (MoE) language model designed for efficient inference and cost-effective training. With 671 billion parameters and advanced architectures like Multi-head Latent Attention (MLA) and DeepseekMoE, it optimizes performance, stability, and scalability. Pre-trained on 14.8 trillion tokens and fine-tuned with reinforcement learning, Deepseek V3 delivers advanced reasoning and language capabilities with remarkable efficiency.

In this article, you will deploy Deepseek V3 on MI300X Vultr Cloud GPU due to large VRAM requirements using SGlang and configure the model for inference. By leveraging Vultr’s high-performance cloud infrastructure, you can efficiently set up Deepseek V3 for advanced reasoning and language tasks.

Prerequsites

Deployment Steps

In this section, you will install the necessary dependencies, build a ROCm-supported container image, and deploy the SGlang inference server with Deepseek V3 on Vultr Cloud GPU. You will then verify the deployment by sending an HTTP request to test the model's inference response.

  1. Install Hugging Face Command Line Interface (CLI) package.

    console
    $ pip install huggingface_hub[cli]
    
  2. Download the Deepseek V3 model.

    console
    $ huggingface-cli download deepseek-ai/DeepSeek-V3
    

    The above command downloads the model on to the $HOME/.cache/huggingface directory. It is recommended to download the model in the background and proceed with the next steps, as the model is very large in size and is not required until you run the container image.

  3. Clone the SGLang inference server repository.

    console
    $ git clone https://github.com/sgl-project/sglang.git
    
  4. Build a ROCm supported container image.

    console
    $ cd sglang/docker
    $ docker build --build-arg SGL_BRANCH=v0.4.2 -t sglang:v0.4.2-rocm620 -f Dockerfile.rocm .
    

    The above command builds a container image named sglang:v0.4.2-rocm620 using the Dockerfile.rocm manifest. This step may require upto 30 minutes.

    If you face the error: RPC failed; curl 56 GnuTLS recv error error at the time of container image build, you can try to add the following lines to the Dockerfile.rocm file before the statements for cloning repositories.

    Dockerfile
    RUN git config --global http.postBuffer 1048576000
    RUN git config --global https.postBuffer 1048576000
    

    Additionally, if you face connection timeouts during the build time, you can try to run the process again to re-establish the connection. Docker is able to cache portions of the build process to ensure efficient use of time and resources.

  5. Run the SGlang inference server container.

    console
    $ docker run -d --device=/dev/kfd --device=/dev/dri --ipc=host \
        --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
        -v $HOME/dockerx:/dockerx -v $HOME/.cache/huggingface:/root/.cache/huggingface \
        --shm-size 16G -p 30000:30000 sglang:v0.4.2-rocm620 \
        python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --host 0.0.0.0 --port 30000
    

    The above command runs the SGlang inference server container in detached mode with ROCm support, enabling GPU access and necessary permissions. It mounts required directories, allocates shared memory, and starts the server on port 30000 using the DeepSeek V3 model with tensor parallelism (TP) set to 8.

  6. Send a HTTP request to verify inference response.

    console
    $ curl http://localhost:30000/v1/chat/completions \
         -H "Content-Type: application/json" \
         -d "{\"model\": \"deepseek-ai/DeepSeek-V3\", \"messages\": [{\"role\": \"user\", \"content\": \"I am running Deepseek on Vultr powered by AMD Instinct MI300X. What's next?\"}], \"temperature\": 0.7}"
    
  7. Optional: Allow incoming connections on port 30000.

    console
    $ sudo ufw allow 30000
    

Conclusion

In this article, you successfully deployed Deepseek V3 on MI300X Vultr Cloud GPU using SGlang and prepared the model for inference. By leveraging Vultr’s high-performance infrastructure, you have set up an optimized environment for running Deepseek V3 efficiently. With the model now ready, you can utilize its advanced reasoning and language capabilities for various applications.