MLOps: Dockerizing ML Models for GPU Inference

In the realm of MLOps (Machine Learning Operations), there are several key components to manage: model deployment, model serving, model registry, model observability, and model monitoring to name a few.
Once you've completed the extensive process of testing and evaluating your machine learning model, the next step is crucial — integrating it into your workflow or scaling it to a production or real-time environment. But how do you go about doing this? Where should your model be deployed, and how do you host it effectively?
In this tutorial, I will walk you through the process of hosting, deploying, and serving machine learning models. There are several tools and platforms like BentoML, Seldon, ZenML, and Vertex AI, among others, that simplify the process of packaging and serving models. However, for this tutorial, I will focus on packaging your model in a Docker container for GPU-accelerated inferences — a crucial technique that most of these platforms handle in the background.
Docker provides several advantages, most notably the isolation of dependencies and libraries, ensuring that your model runs in a consistent environment regardless of where it is deployed. This isolation is particularly useful when working with GPU-powered inferences, as it ensures the required drivers and libraries are packaged along with the model, making deployment more efficient and scalable. Hence, we will be deploying our model in a docker container locally.
Here is the hardware and tech stack used:
- OS: MacOS 14.7
- RAM: 36GB
- CPU: Apple M3 Pro
- GPU: Inbuilt Apple M3 GPU
- Storage: 1 TB SSD
- Python
- Docker
- HuggingFace Libraries and Platform
For our tutorial, we will be using NLLB by Facebook, a leaderboard machine translation model for low resource languages, but feel free to choose another model based on your preference.
facebook/nllb-200-distilled-600M · Hugging Face
Step 1: Setup a local directory
Create a new directory to store all the files related to the model deployment and navigate to the newly created directory.
$ mkdir model_deploy
$ cd model_deploy
Step 2: Download the model from HuggingFace
Create a python script (eg: deploy.py) to download the model from the Hugging Face repository, instantiate it and store it locally.
from transformers import AutoTokenizer, AutoModelForCausalLM, PreTrainedModel
import transformers
import torch
#Enter your local directory you want to store the model in
save_path = "local/directory/you/want/to/store/the/model/in"
#Specify the model you want to download from HF
hf_model = 'facebook/nllb-200-distilled-600M'
#Instantiate the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(hf_model, return_dict=True, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(hf_model)
#Save the model and the tokenizer in the local directory specified earlier
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
Later on, we will be mounting the directory where the model is stored, to the Docker container.
When using Docker, it's efficient to only pass necessary model files and configurations into the container to reduce build size and complexity.
Step 3: Build a file to pass the library dependencies
Create the file (eg: requirements.txt) and add the necessary libraries
$ vi requirements.txt
fastapi==0.111.1
uvicorn==0.30.3
transformers==4.43.2
torch==2.4.0
pydantic==2.6.1
pyjwt==2.8.0
Step 4: Create a file for model inference
We will create FastAPI endpoints, which are ideal for building APIs for model inference due to its high performance, simplicity, and native support for asynchronous programming, which is crucial for handling real-time machine learning tasks.
We will create 2 endpoints —
- status — responds with the status of the model/service
- translate — provides the machine translated text
import os
import time
import uuid
from fastapi import FastAPI, Depends, HTTPException, status
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
from pydantic import BaseModel
import jwt
app = FastAPI()
model_path = "/app/model"
class InputData(BaseModel):
text: str
target_lang: str
def translate_nllb(model_name, input_text, target_lang_code, max_length=250):
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
#If you're running it on NVIDIA GPUs - instruct pytorch to use CUDA
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#If you're running it on Apple Silicon - instruct pytorch to use MPS
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
#Use either CUDA or MPS - do not use both
model = model.to(device)
inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True, max_length=max_length)
inputs = {key: value.to(device) for key, value in inputs.items()}
forced_bos_token_id = tokenizer.convert_tokens_to_ids(target_lang_code)
translated_tokens = model.generate(
**inputs, forced_bos_token_id=forced_bos_token_id, max_length=max_length
)
output = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
return output
#Status endpoint
@app.get("/status")
def read_root():
return {"status": "NLLB is running. Use the /translate endpoint for translations"}
#Translation endpoint
@app.post("/translate")
def get_prediction(input_data: InputData):
translation = translate_nllb(model_path, input_data.text, input_data.target_lang)
return {
"translation": translation,
}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8001)
Step 5: Create the Dockerfile
Create the Dockerfile with the necessary configurations.
$ vi Dockerfile
FROM python:3.9
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --default-timeout=600 -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8001"]
- FROM python:3.9: Uses Python 3.9 as the base image.
- WORKDIR /app: Sets /app as the working directory inside the container where all subsequent commands will be executed.
- COPY requirements.txt .: Copies requirements.txt to the container.
- RUN pip install -r requirements.txt: Installs dependencies listed in requirements.txt.
- COPY command: Copies all files from the current directory to the container.
- CMD command: Starts Uvicorn to run the FastAPI app on port 8001.
Step 6: Build the docker image
Run the following command to build the docker image.
docker build -t nllb_dock .
- docker build: This command creates a new Docker image from a Dockerfile in the current directory.
- Tagging (-t): Assigns the name nllb_dock to the image for easy identification and reference.
- Dot (.): Specifies the build context, meaning the Dockerfile and other necessary files are located in the current directory.
Step 7: Run the container
Now run the container with the following specifications:
$ docker run -d -v local/directory/model/is/stored/in:/app/model --gpus all --name nllb -p 8001:8001 nllb_dock
Here's a concise summary of each point:
- Detached Mode (-d): Runs the container in the background.
- Volume Mount (-v): Links a folder from your local machine to the container.
- GPU Flag (--gpus): Allows the container to access all GPUs available on the host.
- Container Name: Sets the container name to 'nllb' for easy reference.
- Port Mapping (-p): Connects port '8001' on your machine to port '8001' in the container.
- Image: Uses the 'nllb_dock' image you previously built.
Step 8: Call your model via the endpoint
Do a service check by calling the status endpoint
API : https://localhost:8001/status
Response : "NLLB is running. Use the /translate endpoint for translations"
Step 9: Generate inferences from your model
Here's how you can invoke the model:
API : https://localhost:8001/translate
Request Body :
{
"text" : "This is a test",
"target_lang" : "ory_Orya"
}
Response :
{
"translation": "ଏହା ଗୋଟିଏ ପରୀକ୍ଷା"
}
With that, your model deployment and model serving is successfully completed. Now you can leverage this as a service and share it with your team or integrate it into an app — making it ready for scalable production use!