amiepsa's website

MLOps: Dockerizing ML Models for GPU Inference

MLOPS_1_6O_d2KWczALMecWYEsn4yA

In the realm of MLOps (Machine Learning Operations), there are several key components to manage: model deployment, model serving, model registry, model observability, and model monitoring to name a few.

Once you've completed the extensive process of testing and evaluating your machine learning model, the next step is crucial — integrating it into your workflow or scaling it to a production or real-time environment. But how do you go about doing this? Where should your model be deployed, and how do you host it effectively?

In this tutorial, I will walk you through the process of hosting, deploying, and serving machine learning models. There are several tools and platforms like BentoML, Seldon, ZenML, and Vertex AI, among others, that simplify the process of packaging and serving models. However, for this tutorial, I will focus on packaging your model in a Docker container for GPU-accelerated inferences — a crucial technique that most of these platforms handle in the background.

Docker provides several advantages, most notably the isolation of dependencies and libraries, ensuring that your model runs in a consistent environment regardless of where it is deployed. This isolation is particularly useful when working with GPU-powered inferences, as it ensures the required drivers and libraries are packaged along with the model, making deployment more efficient and scalable. Hence, we will be deploying our model in a docker container locally.

Here is the hardware and tech stack used:

For our tutorial, we will be using NLLB by Facebook, a leaderboard machine translation model for low resource languages, but feel free to choose another model based on your preference.

facebook/nllb-200-distilled-600M · Hugging Face


Step 1: Setup a local directory

Create a new directory to store all the files related to the model deployment and navigate to the newly created directory.

$ mkdir model_deploy
$ cd model_deploy

Step 2: Download the model from HuggingFace

Create a python script (eg: deploy.py) to download the model from the Hugging Face repository, instantiate it and store it locally.

from transformers import AutoTokenizer, AutoModelForCausalLM, PreTrainedModel
import transformers
import torch

#Enter your local directory you want to store the model in
save_path = "local/directory/you/want/to/store/the/model/in"

#Specify the model you want to download from HF
hf_model = 'facebook/nllb-200-distilled-600M'

#Instantiate the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(hf_model, return_dict=True, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(hf_model)

#Save the model and the tokenizer in the local directory specified earlier
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

Later on, we will be mounting the directory where the model is stored, to the Docker container.

When using Docker, it's efficient to only pass necessary model files and configurations into the container to reduce build size and complexity.

Step 3: Build a file to pass the library dependencies

Create the file (eg: requirements.txt) and add the necessary libraries

$ vi requirements.txt

fastapi==0.111.1
uvicorn==0.30.3
transformers==4.43.2
torch==2.4.0
pydantic==2.6.1
pyjwt==2.8.0

Step 4: Create a file for model inference

We will create FastAPI endpoints, which are ideal for building APIs for model inference due to its high performance, simplicity, and native support for asynchronous programming, which is crucial for handling real-time machine learning tasks.

FastAPI

We will create 2 endpoints —

  1. status — responds with the status of the model/service
  2. translate — provides the machine translated text
import os
import time
import uuid
from fastapi import FastAPI, Depends, HTTPException, status
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
from pydantic import BaseModel
import jwt

app = FastAPI()

model_path = "/app/model"

class InputData(BaseModel):
    text: str
    target_lang: str

def translate_nllb(model_name, input_text, target_lang_code, max_length=250):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    
    #If you're running it on NVIDIA GPUs - instruct pytorch to use CUDA
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    #If you're running it on Apple Silicon - instruct pytorch to use MPS
    device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
   
    #Use either CUDA or MPS - do not use both

    model = model.to(device)
    inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True, max_length=max_length)
    inputs = {key: value.to(device) for key, value in inputs.items()}
    forced_bos_token_id = tokenizer.convert_tokens_to_ids(target_lang_code)
    translated_tokens = model.generate(
        **inputs, forced_bos_token_id=forced_bos_token_id, max_length=max_length
    )
    output = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
    return output

#Status endpoint
@app.get("/status")
def read_root():
    return {"status": "NLLB is running. Use the /translate endpoint for translations"}

#Translation endpoint
@app.post("/translate")
def get_prediction(input_data: InputData):
    translation = translate_nllb(model_path, input_data.text, input_data.target_lang)

    return {
        "translation": translation,
    }

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8001)

Step 5: Create the Dockerfile

Create the Dockerfile with the necessary configurations.

$ vi Dockerfile

FROM python:3.9
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --default-timeout=600 -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8001"]

Step 6: Build the docker image

Run the following command to build the docker image.

docker build -t nllb_dock .

Step 7: Run the container

Now run the container with the following specifications:

$ docker run -d -v local/directory/model/is/stored/in:/app/model --gpus all --name nllb -p 8001:8001 nllb_dock

Here's a concise summary of each point:

Step 8: Call your model via the endpoint

Do a service check by calling the status endpoint

API : https://localhost:8001/status

Response : "NLLB is running. Use the /translate endpoint for translations"

Step 9: Generate inferences from your model

Here's how you can invoke the model:

API : https://localhost:8001/translate

Request Body :
{
  "text" : "This is a test",
  "target_lang" : "ory_Orya"
}

Response :
{
  "translation": "ଏହା ଗୋଟିଏ ପରୀକ୍ଷା"
}

With that, your model deployment and model serving is successfully completed. Now you can leverage this as a service and share it with your team or integrate it into an app — making it ready for scalable production use!

#tech-blogs