How to Deploy Deepseek R1 Reasoning Large Language Model (LLM) Using SGLang

Updated on February 1, 2025
How to Deploy Deepseek R1 Reasoning Large Language Model (LLM) Using SGLang header image

Deepseek R1 is a first-generation reasoning model designed to excel in mathematical, coding, and logical reasoning tasks. It leverages reinforcement learning (RL) with a carefully integrated cold-start phase to enhance readability, coherence, and reasoning capabilities. This approach helps the model generate clear, well-structured responses while minimizing issues like repetition and language mixing. Deepseek R1 is optimized for high-quality reasoning, making it a powerful tool for tackling complex problem-solving tasks.

In this article, you will deploy Deepseek R1 on MI300X Vultr Cloud GPU due to large VRAM requirements using SGlang and configure the model for inference. By leveraging Vultr’s high-performance cloud infrastructure, you can efficiently set up Deepseek R1 for advanced reasoning tasks.

Prerequsites

Deployment Steps

In this section, you will install the necessary dependencies, build a ROCm-supported container image, and deploy the SGlang inference server with Deepseek R1 on Vultr Cloud GPU. You will then verify the deployment by sending an HTTP request to test the model's inference response.

  1. Install Hugging Face Command Line Interface (CLI) package.

    console
    $ pip install huggingface_hub[cli]
    
  2. Download the Deepseek R1 model.

    console
    $ huggingface-cli download deepseek-ai/DeepSeek-R1
    

    The above command downloads the model on to the $HOME/.cache/huggingface directory. It is recommended to download the model in the background and proceed with the next steps, as the model is very large in size and is not required until you run the container image.

  3. Clone the SGLang inference server repository.

    console
    $ git clone https://github.com/sgl-project/sglang.git
    
  4. Build a ROCm supported container image.

    console
    $ cd sglang/docker
    $ docker build --build-arg SGL_BRANCH=v0.4.2 -t sglang:v0.4.2-rocm620 -f Dockerfile.rocm .
    

    The above command builds a container image named sglang:v0.4.2-rocm620 using the Dockerfile.rocm manifest. This step may require upto 30 minutes.

    If you face the error: RPC failed; curl 56 GnuTLS recv error error at the time of container image build, you can try to add the following lines to the Dockerfile.rocm file before the statements for cloning repositories.

    Dockerfile
    RUN git config --global http.postBuffer 1048576000
    RUN git config --global https.postBuffer 1048576000
    

    Additionally, if you face connection timeouts during the build time, you can try to run the process again to re-establish the connection. Docker is able to cache portions of the build process to ensure efficient use of time and resources.

  5. Run the SGlang inference server container.

    console
    $ docker run -d --device=/dev/kfd --device=/dev/dri --ipc=host \
        --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
        -v $HOME/dockerx:/dockerx -v $HOME/.cache/huggingface:/root/.cache/huggingface \
        --shm-size 16G -p 30000:30000 sglang:v0.4.2-rocm620 \
        python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --tp 8 --trust-remote-code --host 0.0.0.0 --port 30000
    

    The above command runs the SGlang inference server container in detached mode with ROCm support, enabling GPU access and necessary permissions. It mounts required directories, allocates shared memory, and starts the server on port 30000 using the Deepseek R1 model with tensor parallelism (TP) set to 8.

  6. Send a HTTP request to verify inference response.

    console
    $ curl http://localhost:30000/v1/chat/completions \
         -H "Content-Type: application/json" \
         -d "{\"model\": \"deepseek-ai/DeepSeek-R1\", \"messages\": [{\"role\": \"user\", \"content\": \"I am running Deepseek on Vultr powered by AMD Instinct MI300X. What's next?\"}], \"temperature\": 0.7}"
    
  7. Optional: Allow incoming connections on port 30000.

    console
    $ sudo ufw allow 30000
    

Conclusion

In this article, you successfully deployed Deepseek R1 on MI300X Vultr Cloud GPU using SGlang and prepared the model for inference. By leveraging Vultr’s high-performance infrastructure, you have set up an optimized environment for running Deepseek R1 efficiently. With the model now ready, you can utilize its advanced reasoning capabilities for various applications.