Distributed Training on AMD Instinct™ MI325X Clusters with dstack

Updated on 11 June, 2025
Guide
Run high-performance, distributed LLM training on AMD Instinct™ MI325X GPUs using dstack and verl without Kubernetes or Slurm complexity.
Distributed Training on AMD Instinct™ MI325X Clusters with dstack header image

dstack is an open-source orchestrator purpose-built for AI as a streamlined alternative to Kubernetes and Slurm. It simplifies the orchestration of AI workloads across both VM and bare metal AMD clusters, so teams can focus on model development, not infrastructure.

This guide demonstrates how to use dstack to orchestrate distributed training on a cluster of Vultr’s Bare Metal instances with AMD GPUs. While dstack allows using any distributed framework, this guide uses verl, an open-source framework for reinforcement learning training. The purpose is to train Qwen2.5-7B-Instruct to solve grade school math word problems using the GSM8K dataset, which features 8.5K problems requiring 2–8 reasoning steps.

Prerequisites

  • To begin, install dstack by following the installation instructions. Once dstack server is up you can initialize your workspace.

    console
    $ mkdir dstack-verl-example && cd dstack-verl-example
    $ dstack init
    

Create a Fleet for Your AMD GPU Cluster

  1. Create a mi325x-fleet.dstack.yml file.

    console
    $ nano mi325x-fleet.dstack.yml
    
  2. Copy and paste the below configuration.

    yaml
    type: fleet
    name: mi325x-fleet.dstack.yml
    
    ssh_config:
      user: root
      identity_file: ~/.ssh/id_rsa
      hosts:
        - 144.202.58.28
        - 137.220.58.52
    

    Under hosts, hostnames of the cluster nodes are listed.

    Save and close the file.

  3. Apply the fleet configuration.

    console
    $ dstack apply -f mi325x-fleet.dstack.yml
    

    While creating a fleet, dstack automatically detects AMD GPUs and checks drivers on each node. Once the fleet is created, it's marked as idle and can be used for running dev environments, tasks, and services.

Validate Interconnect with RCCL tests

Before training, it’s important to validate multi-GPU and inter-node communication. dstack makes it convenient to launch RCCL tests across your cluster using a task.

  1. Create rccl-tests.dstack.yml file for distributed task configuration.

    console
    $ nano rccl-tests.dstack.yml
    
  2. Copy and paste the below configuration.

    yaml
    type: task
    name: rccl-tests
    
    nodes: 2
    startup_order: workers-first
    stop_criteria: master-done
    
    volumes:
      - /usr/local/lib:/mnt/lib
    
    image: rocm/dev-ubuntu-22.04:6.4-complete
    env:
      - NCCL_DEBUG=INFO
      - OPEN_MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi
    
    commands:
      - export LD_PRELOAD=/mnt/lib/libbnxt_re-rdmav34.so
      - apt-get install -y git libopenmpi-dev openmpi-bin
      - git clone https://github.com/ROCm/rccl-tests.git
      - cd rccl-tests
      - make MPI=1 MPI_HOME=$OPEN_MPI_HOME
      - |
        if [ $DSTACK_NODE_RANK -eq 0 ]; then
          mpirun --allow-run-as-root \
            --hostfile $DSTACK_MPI_HOSTFILE \
            -n $DSTACK_GPUS_NUM \
            -N $DSTACK_GPUS_PER_NODE \
            --mca btl_tcp_if_include ens41np0 \
            -x LD_PRELOAD \
            -x NCCL_IB_HCA=bnxt_re0,... \
            -x NCCL_IB_GID_INDEX=3 \
            -x NCCL_IB_DISABLE=0 \
            ./build/all_reduce_perf -b 8M -e 8G -f 2 -g 1 -w 5 --iters 20 -c 0;
        else
          sleep infinity
        fi
    
    resources:
      gpu: MI325X:8
    
    • This YAML file defines a dstack task to run RCCL performance tests on a 2-node AMD MI325X GPU cluster.
    • It installs MPI and builds the RCCL test suite on the master node.
    • The master node then launches an mpirun command to run all_reduce_perf across all GPUs.
    • RDMA libraries are preloaded to enable high-speed interconnects.
    • Worker nodes simply sleep to stay active while the test runs.

    Save and close the file.

  3. Run the task using the dstack apply command.

    console
    $ dstack apply -f rccl-tests.dstack.yml
    

    As the task runs, you’ll see the output of RCCL tests showing the bandwidth between the GPU and nodes.

Launch a Ray Cluster for Training

dstack tasks allow you to run any distributed workload directly using torch run, accelerate, or other distributed frameworks. However, because verl requires Ray, we need to launch a Ray cluster as a task before submitting Ray jobs.

  1. Create a ray-cluster.dstack.yml to launch Ray cluster on the fleet.

    console
    $ nano ray-cluster.dstack.yml
    
  2. Copy and paste the below configuration.

    yaml
    type: task
    name: ray-cluster-ppo
    nodes: 2
    
    env:
      - NCCL_DEBUG=TRACE
      - GPU_MAX_HW_QUEUES=2
      - TORCH_NCCL_HIGH_PRIORITY=1
      - NCCL_CHECKS_DISABLE=1
      - NCCL_IB_HCA=bnxt_re0,...
      - NCCL_IB_GID_INDEX=3
      - NCCL_CROSS_NIC=0
      - CUDA_DEVICE_MAX_CONNECTIONS=1
      - NCCL_PROTO=Simple
      - RCCL_MSCCL_ENABLE=0
      - TOKENIZERS_PARALLELISM=false
      - HSA_NO_SCRATCH_RECLAIM=1
      - HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
      - ROCR_VISIBLE_DEVICES=$HIP_VISIBLE_DEVICES
      - CUDA_VISIBLE_DEVICES=$HIP_VISIBLE_DEVICES
      - MODEL_PATH=Qwen/Qwen2.5-7B-Instruct
      - train_files=../data/gsm8k/train.parquet
      - test_files=../data/gsm8k/test.parquet
    
    image: verl-rocm
    
    commands:
      - export LD_PRELOAD=/mnt/lib/libbnxt_re-rdmav34.so
      - pip install hf_transfer hf_xet
      - |
        if [ $DSTACK_NODE_RANK = 0 ]; then
          python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k
          python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2.5-7B-Instruct')"
          ray start --head --port=6379;
        else
          ray start --address=$DSTACK_MASTER_NODE_IP:6379
        fi
    
    ports:
      - 8265
    
    resources:
      gpu: MI325X:8
      shm_size: 24GB
    
    volumes:
      - /checkpoints:/checkpoints
      - /usr/local/lib:/mnt/lib
    
    Note
    You will need to build the verl-rocm Docker image as described in AMD’s ROCm blog.

    Save and close the file.

  3. Run the task using the dstack apply command.

    console
    $ dstack apply -f ray-cluster.dstack.yml
    

    When the task exposes ports, the dstack apply command automatically forwards these ports to the current machine. In our case, this makes the Ray's dashboard available locally at localhost:8265.

Submit the Training Job via Ray

  1. Install Ray locally.

    console
    $ pip install ray
    
  2. Submit the traning job.

    RAY_ADDRESS=http://localhost:8265
    python3 -m verl.trainer.main_ppo \
      data.train_files=$train_files \
      data.val_files=$test_files \
      data.train_batch_size=1024 \
      data.max_prompt_length=1024 \
      data.max_response_length=1024 \
      actor_rollout_ref.model.path=$MODEL_PATH \
      actor_rollout_ref.model.enable_gradient_checkpointing=True \
      actor_rollout_ref.actor.optim.lr=1e-6 \
      actor_rollout_ref.actor.ppo_mini_batch_size=256 \
      actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
      actor_rollout_ref.actor.fsdp_config.param_offload=False \
      actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
      actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
      actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
      actor_rollout_ref.rollout.name=vllm \
      actor_rollout_ref.rollout.gpu_memory_utilization=0.9 \
      actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
      actor_rollout_ref.ref.fsdp_config.param_offload=True \
      critic.optim.lr=1e-5 \
      critic.model.use_remove_padding=True \
      critic.model.path=Qwen/Qwen2.5-7B-Instruct \
      critic.model.enable_gradient_checkpointing=False \
      critic.ppo_micro_batch_size_per_gpu=8 \
      critic.model.fsdp_config.param_offload=False \
      critic.model.fsdp_config.optimizer_offload=False \
      algorithm.kl_ctrl.kl_coef=0.0001 \
      trainer.critic_warmup=0 \
      trainer.logger=[console] \
      trainer.project_name='verl_example' \
      trainer.experiment_name='Qwen2.5-7B-PPO' \
      trainer.n_gpus_per_node=8 \
      trainer.nnodes=2 \
      trainer.default_local_dir=/checkpoints \
      trainer.val_before_train=False \
      trainer.save_freq=10 \
      trainer.test_freq=10 \
      trainer.total_epochs=15
  3. Monitor GPU metrics.

    console
    $ dstack metrics -w ray-cluster-ppo
    

    Monitoring GPU utilization and cluster health is crucial during training. dstack provides real-time metrics through both CLI and dashboard. Additionally, monitoring metrics is possible also via dstack server’s UI dashboard.

RoCE Compatibility and Checkpoint Recovery

Broadcom RoCE drivers require the libbnxt_re userspace library inside the container to be compatible with the host’s Broadcom kernel driver bnxt_re. To ensure this compatibility, we mount libbnxt_re-rdmav34.so from the host and preload it using LD_PRELOAD when running MPI.

All training checkpoints are saved to an instance volume, enabling seamless recovery in case of interruptions or node failures.

Conclusion

Leveraging AMD Instinct™ MI325X GPUs, ROCm, and dstack enables seamless, scalable, and high-performance distributed LLM training across your fleet. With dstack, you can avoid the operational complexity of managing infrastructure with Kubernetes or Slurm.

For more information:

Comments

No comments yet.