amiepsa's website

A beginner’s guide to deploying Falcon-7b locally (via CPU or GPU)

1_jL2h6a36AcdJtTzDNV420A

The Falcon-7B Language Learning Model (LLM), developed by the Technology Innovation Institute in the United Arab Emirates (TIIUAE), has been a game-changer in the AI-driven language learning domain. This model exemplifies state-of-the-art machine learning techniques, providing an immersive and intuitive language learning experience that helps bridge communication barriers across the globe.

But what makes the Falcon-7B LLM even more fascinating is its capacity for local deployment. Deploying the Falcon-7B LLM on local resources is a potential boon for organizations and institutions. In addition to keeping data on-site, which provides a crucial extra layer of security, local deployment can dramatically reduce latency, enhancing the user experience. This locally-hosted model also allows for better customization and fine-tuning based on specific organizational needs.

Falcon-7B on Hugging Face


This tutorial helps you deploy the falcon-7b locally and infer via CPU or GPU, as per the local resources available to you.

Here are the specs of the computer that we are running falcon-7b on :

OS : Ubuntu 22.04 LTS

RAM : 128GB

Storage : 1 TB SSD

CPU : Intel Core i9–11900K

GPU : Nvidia GeForce RTX 3090i

Using the aforementioned compute resources, a CPU inference took about ~20 mins, whereas a GPU inference took about ~10 seconds.

Step 1

Create a new directory to store all the files related to falcon-7b and navigate to the newly created directory.

$ cd falcon7b

Step 2

Create a virtual environment to install and configure the required dependencies in the newly created directory. After that, activate the environment.

$ source falconenv/bin/activate

Step 3

Once inside the virtual environment install the required dependencies.

$ pip install transformers torch einops 

Step 4

Create a python script (eg: deploy.py) to download the model from the Hugging Face repository, instantiate it and store it locally.

Infer via CPU

from transformers import AutoTokenizer, AutoModelForCausalLM, PreTrainedModel
import transformers
import torch

#Enter your local directory you want to store the model in
save_path = “local/directory/you/want/to/store/the/model/in”

#Specify the model you want to download from HF
hf_model = 'tiiuae/falcon-7b-instruct'

#Instantiate the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(hf_model, return_dict=True, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(hf_model)

#Save the model and the tokenizer in the local directory specified earlier
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

Reference to documentation on using Hugging Face "from_pretrained" class

Infer via GPU

If you would like to infer via GPU, you will have to download and install another package called accelerate, along with the other libraries namely : transformers, torch, einops.

$ pip install accelerate

You can also check if your GPU is detected and functioning properly, using the following command:

$ nvidia-smi
from transformers import AutoTokenizer, AutoModelForCausalLM, PreTrainedModel
import transformers
import torch

#Enter your local directory you want to store the model in
save_path = "local/directory/where/model/is/stored"

#Specify the model you want to download from HF
hf_model = 'tiiuae/falcon-7b'

#Load the model and store it 
model = AutoModelForCausalLM.from_pretrained(hf_model, return_dict=True, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto",)
tokenizer = AutoTokenizer.from_pretrained(hf_model)

#Save the model and the tokenizer in the local directory specified earlier
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

Step 5

Run the python script and verify that the model in its entirety has been downloaded locally. You can verify by navigating to the ‘save_path’ you had mentioned earlier and check if all the model’s files are present.

$ python deploy.py

Step 6

If you are using a GPU you would need to download the dataset that has been used to train falcon, if not you can directly move to the Step 7.

Make sure you have git-lfs installed

$ git lfs install  
$ git clone https://huggingface.co/datasets/tiiuae/falcon-refinedweb

Step 7

Create the following python script and for reference, let’s call it inference.py

Infer via CPU

from transformers import AutoTokenizer, AutoModelForCausalLM, PreTrainedModel
import transformers
import torch
save_path = "local/directory/where/model/is/stored"

#Load the model and tokenizer from local storage
local_model = AutoModelForCausalLM.from_pretrained(save_path, return_dict=True, trust_remote_code=True)
local_tokenizer = AutoTokenizer.from_pretrained(save_path)
pipeline = transformers.pipeline(
    "text-generation",
    model=local_model,
    tokenizer=local_tokenizer,
)

#Prompt the model with the required parameters
sequences = pipeline(
  "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
   max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=local_tokenizer.eos_token_id,
)

#Output the inference
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Infer via GPU

from transformers import AutoTokenizer, AutoModelForCausalLM, PreTrainedModel
import transformers
import torch
from accelerate import infer_auto_device_map
save_path = "local/directory/where/model/is/stored"

#Empty GPU cache
torch.cuda.empty_cache()

#Define the batch size and load the model's dataset
dataset = "local/directory/where/dataset/is/stored"
torch.utils.data.DataLoader(dataset, batch_size=1)

#Load the model from local storage and infer
local_model = AutoModelForCausalLM.from_pretrained(save_path, return_dict=True, trust_remote_code=True, device_map="auto",torch_dtype=torch.bfloat16).to("cuda")
local_tokenizer = AutoTokenizer.from_pretrained(save_path)
pipeline = transformers.pipeline(
    "text-generation",
    model=local_model,
    tokenizer=local_tokenizer,
    torch_dtype=torch.bfloat16,
)

#Prompt the model with the required parameters
sequences = pipeline(
  "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
   max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=local_tokenizer.eos_token_id,
)

#Output the inference
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Step 8

Run the python script and you should get your first inference from falcon-7b!

$ python inference.py  
Result: Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.  
Daniel: Hello, Girafatron!  
Girafatron: Hello, Daniel! I have been expecting you.  
Daniel: I’m glad to hear it. So, why are giraffes so great?  
Girafatron: They are great because they are so majestic. I love how they look so elegant, even though they are just a neck.  
Daniel: Well, they are great at eating stuff, I suppose.  
Girafatron: I have always said that the only thing that can be better than a giraffe is a girafatron.  
Daniel: A girafatron? What’s a girafatron?

That’s about it! If you want to take it a step further, you can expose the model via an API using FastAPI or create a user-interface using Chainlit, to interact with it.

#tech-blogs