# Running DeepSeek R1 671B with Distributed vLLM This tutorial guides you through the process of configuring and running the original **DeepSeek R1 671B** using **Distributed vLLM** on a GPUStack cluster. Due to the extremely large size of the model, distributed inference across multiple workers is usually required. MASS-Base enables easy setup and orchestration of distributed inference using vLLM, making it possible to run massive models like DeepSeek R1 with minimal manual configuration. ## Prerequisites Before you begin, make sure the following requirements are met: - You have access to a sufficient number of Linux nodes, each equipped with the required GPUs. For example:
| **GPU** | **Number of Nodes** | | ---------------- | ------------------- | | H100/H800:8 | 2 | | A100/A800-80GB:8 | 4 | | A100/A800:8 | 8 |
- High-speed interconnects such as NVLink or InfiniBand are recommended for optimal performance. - Model files should be downloaded to the same path on each node. While MASS-Base supports on-the-fly model downloading, pre-downloading is recommended as it can be time consuming depending on the network speed. !!! note - In this tutorial, we assume a setup of 4 nodes, each equipped with 8 A800-80GB GPUs and connected via 200G InfiniBand. - A100/A800 GPUs do not support the FP8 precision originally used by DeepSeek R1. Hence, we use the BF16 version from [Unsloth](https://huggingface.co/unsloth/DeepSeek-R1-BF16). ## Step 1: Install GPUStack Server According to the [Installation](../installation/installation.md), you can use the following command to start the MASS-Base server: ```bash sudo docker run -d --name gpustack \ --restart unless-stopped \ -p 80:80 \ --volume gpustack-data:/var/lib/gpustack \ --volume /path/to/your/model:/path/to/your/model \ gpustack/gpustack ``` !!! note - Replace `/path/to/your/model` with the actual path. After MASS-Base server is up and running, run the following commands to get the initial admin password: ```bash sudo docker exec gpustack \ cat /var/lib/gpustack/initial_admin_password ``` ## Step 2: Access MASS-Base UI Login to the MASS-Base UI using the `admin` user and the obtained password. ``` http://your_gpustack_server_ip_or_hostname ``` ## Step 3: Install MASS-Base Workers Navigate to the `Workers` page in the MASS-Base UI, click `Add Worker` button to get the command for adding workers. And then on **each worker node**, run the worker adding command to start a MASS-Base worker: ```bash sudo docker run -d --name gpustack \ --restart unless-stopped \ --privileged \ --network host \ --volume /var/run/docker.sock:/var/run/docker.sock \ --volume gpustack-data:/var/lib/gpustack \ --volume /path/to/your/model:/path/to/your/model \ --runtime nvidia \ gpustack/gpustack \ --server-url http://your_gpustack_server_ip_or_hostname \ --token your_gpustack_cluster_token ``` !!! note - Replace the placeholder paths, IP address/hostname, and cluster token accordingly. - Replace `/path/to/your/model` with the actual path on your system where the DeepSeek R1 model files are stored. After all workers are added, return to the MASS-Base UI. Navigate to the `Workers` page to verify that all workers are in the Ready state and their GPUs are listed. ![initial-resources](../assets/tutorials/running-deepseek-r1-671b-with-distributed-vllm/initial-resources.png) ## Step 4: Deploy the DeepSeek R1 Model 1. Go to the `Deployments` page. 2. Click `Deploy Model`. 3. Select `Local Path` as your source. 4. Enter a name (e.g., `DeepSeek-R1`) in the `Name` field. 5. Specify the `Model Path` as the directory that contains the DeepSeek R1 model files on each worker node. 6. Ensure the `Backend` is set to `vLLM`. 7. After passing the compatibility check, click `Save` to deploy. ![deploy-model](../assets/tutorials/running-deepseek-r1-671b-with-distributed-vllm/deploy-model.png) ## Step 5: Monitor Deployment You can monitor the deployment status on the `Deployments` page. Hover over `distributed across workers` to view GPU and worker usage. Click `View Logs` to see real-time logs showing model loading progress. It may take a few minutes to load the model. ![model-info](../assets/tutorials/running-deepseek-r1-671b-with-distributed-vllm/model-info.png) After the model is running, navigate to the `Workers` page to check GPU utilization. By default, vLLM uses 90% of GPU memory. You may adjust this in the model configuration settings. ![resources-loaded](../assets/tutorials/running-deepseek-r1-671b-with-distributed-vllm/resources-loaded.png) ## Step 6: Run Inference via Playground Once the model is deployed and running, you can test it using the MASS-Base Playground. 1. Navigate to the `Playground` -> `Chat`. 2. If only one model is deployed, it will be selected by default. Otherwise, use the dropdown menu to choose `DeepSeek-R1`. 3. Enter prompts and interact with the model. ![playground-chat](../assets/tutorials/running-deepseek-r1-671b-with-distributed-vllm/playground-chat.png) You can also use the `Compare` tab to test concurrent inference scenarios. ![playground-compare](../assets/tutorials/running-deepseek-r1-671b-with-distributed-vllm/playground-compare.png) You have now successfully deployed and run DeepSeek R1 671B using Distributed vLLM on a MASS-Base cluster. Explore the model’s performance and capabilities in your own applications. For further assistance, feel free to reach out to the MASS-Base community or support team.