MLOps: Dockerizing ML Models for GPU Inference

18 Sep, 2024

MLOPS_1_6O_d2KWczALMecWYEsn4yA

In the realm of MLOps (Machine Learning Operations), there are several key components to manage: model deployment, model serving, model registry, model observability, and model monitoring to name a few.

Once you've completed the extensive process of testing and evaluating your machine learning model, the next step is crucial — integrating it into your workflow or scaling it to a production or real-time environment. But how do you go about doing this? Where should your model be deployed, and how do you host it effectively?

In this tutorial, I will walk you through the process of hosting, deploying, and serving machine learning models. There are several tools and platforms like BentoML, Seldon, ZenML, and Vertex AI, among others, that simplify the process of packaging and serving models. However, for this tutorial, I will focus on packaging your model in a Docker container for GPU-accelerated inferences — a crucial technique that most of these platforms handle in the background.

Docker provides several advantages, most notably the isolation of dependencies and libraries, ensuring that your model runs in a consistent environment regardless of where it is deployed. This isolation is particularly useful when working with GPU-powered inferences, as it ensures the required drivers and libraries are packaged along with the model, making deployment more efficient and scalable. Hence, we will be deploying our model in a docker container locally.

Here is the hardware and tech stack used:

OS: MacOS 14.7
RAM: 36GB
CPU: Apple M3 Pro
GPU: Inbuilt Apple M3 GPU
Storage: 1 TB SSD
Python
Docker
HuggingFace Libraries and Platform

For our tutorial, we will be using NLLB by Facebook, a leaderboard machine translation model for low resource languages, but feel free to choose another model based on your preference.

facebook/nllb-200-distilled-600M · Hugging Face

Step 1: Setup a local directory

Create a new directory to store all the files related to the model deployment and navigate to the newly created directory.

$ mkdir model_deploy
$ cd model_deploy

Step 2: Download the model from HuggingFace

Create a python script (eg: deploy.py) to download the model from the Hugging Face repository, instantiate it and store it locally.

from transformers import AutoTokenizer, AutoModelForCausalLM, PreTrainedModel
import transformers
import torch

#Enter your local directory you want to store the model in
save_path = "local/directory/you/want/to/store/the/model/in"

#Specify the model you want to download from HF
hf_model = 'facebook/nllb-200-distilled-600M'

#Instantiate the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(hf_model, return_dict=True, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(hf_model)

#Save the model and the tokenizer in the local directory specified earlier
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

Later on, we will be mounting the directory where the model is stored, to the Docker container.

When using Docker, it's efficient to only pass necessary model files and configurations into the container to reduce build size and complexity.

Step 3: Build a file to pass the library dependencies

Create the file (eg: requirements.txt) and add the necessary libraries

$ vi requirements.txt

fastapi==0.111.1
uvicorn==0.30.3
transformers==4.43.2
torch==2.4.0
pydantic==2.6.1
pyjwt==2.8.0

Step 4: Create a file for model inference

We will create FastAPI endpoints, which are ideal for building APIs for model inference due to its high performance, simplicity, and native support for asynchronous programming, which is crucial for handling real-time machine learning tasks.

FastAPI

We will create 2 endpoints —

status — responds with the status of the model/service
translate — provides the machine translated text

import os
import time
import uuid
from fastapi import FastAPI, Depends, HTTPException, status
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
from pydantic import BaseModel
import jwt

app = FastAPI()

model_path = "/app/model"

class InputData(BaseModel):
    text: str
    target_lang: str

def translate_nllb(model_name, input_text, target_lang_code, max_length=250):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    
    #If you're running it on NVIDIA GPUs - instruct pytorch to use CUDA
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    #If you're running it on Apple Silicon - instruct pytorch to use MPS
    device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
   
    #Use either CUDA or MPS - do not use both

    model = model.to(device)
    inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True, max_length=max_length)
    inputs = {key: value.to(device) for key, value in inputs.items()}
    forced_bos_token_id = tokenizer.convert_tokens_to_ids(target_lang_code)
    translated_tokens = model.generate(
        **inputs, forced_bos_token_id=forced_bos_token_id, max_length=max_length
    )
    output = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
    return output

#Status endpoint
@app.get("/status")
def read_root():
    return {"status": "NLLB is running. Use the /translate endpoint for translations"}

#Translation endpoint
@app.post("/translate")
def get_prediction(input_data: InputData):
    translation = translate_nllb(model_path, input_data.text, input_data.target_lang)

    return {
        "translation": translation,
    }

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8001)

Step 5: Create the Dockerfile

Create the Dockerfile with the necessary configurations.

$ vi Dockerfile

FROM python:3.9
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --default-timeout=600 -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8001"]

FROM python:3.9: Uses Python 3.9 as the base image.
WORKDIR /app: Sets /app as the working directory inside the container where all subsequent commands will be executed.
COPY requirements.txt .: Copies requirements.txt to the container.
RUN pip install -r requirements.txt: Installs dependencies listed in requirements.txt.
COPY command: Copies all files from the current directory to the container.
CMD command: Starts Uvicorn to run the FastAPI app on port 8001.

Step 6: Build the docker image

Run the following command to build the docker image.

docker build -t nllb_dock .

docker build: This command creates a new Docker image from a Dockerfile in the current directory.
Tagging (-t): Assigns the name nllb_dock to the image for easy identification and reference.
Dot (.): Specifies the build context, meaning the Dockerfile and other necessary files are located in the current directory.

Step 7: Run the container

Now run the container with the following specifications:

$ docker run -d -v local/directory/model/is/stored/in:/app/model --gpus all --name nllb -p 8001:8001 nllb_dock

Here's a concise summary of each point:

Detached Mode (-d): Runs the container in the background.
Volume Mount (-v): Links a folder from your local machine to the container.
GPU Flag (--gpus): Allows the container to access all GPUs available on the host.
Container Name: Sets the container name to 'nllb' for easy reference.
Port Mapping (-p): Connects port '8001' on your machine to port '8001' in the container.
Image: Uses the 'nllb_dock' image you previously built.

Step 8: Call your model via the endpoint

Do a service check by calling the status endpoint

API : https://localhost:8001/status

Response : "NLLB is running. Use the /translate endpoint for translations"

Step 9: Generate inferences from your model

Here's how you can invoke the model:

API : https://localhost:8001/translate

Request Body :
{
  "text" : "This is a test",
  "target_lang" : "ory_Orya"
}

Response :
{
  "translation": "ଏହା ଗୋଟିଏ ପରୀକ୍ଷା"
}

With that, your model deployment and model serving is successfully completed. Now you can leverage this as a service and share it with your team or integrate it into an app — making it ready for scalable production use!

#tech-blogs