Automating the training of ML models with Google Cloud AI Platform

Author

Sacha Lasry

Data Scientist at Artefact France

TL;DR

Training a ML model can sometimes be complicated to set-up and replicate:

• It might be done using some code hosted on a notebook on a VM that you have to launch manually and turn off when it’s finished
• You may have to upload a training dataset each time you want to train it again
• You need to deep dive into your code when you want to change a single parameter
• etc.

In this article, we’ll see how we automated the training process of FastAI’s text classifiers, using Google Cloud AI Platform.

In a second article, we’ll see how we managed to deploy such models with AI Platform and TorchServe.

You can find more about us and our projects on our Medium blog

View Articles

For who?

If you’re working on a project that requires to train ML models multiple times, and you’re tired of having to manually run your trainings, you’ve come to the right place.

If you’re tired of managing VMs for your training and just want your time to be allowed to something more interesting, like reading Medium articles, you’ve also come to the right place!

This article is dedicated to those who want to know how they can gain time and resources by using AI Platform for the training of their ML models. We’ll see in this article how we applied this to a project we worked on, using FastAI.

Pre-requisites if you want to reproduce what we did

AI Platform is part of the Google Cloud Platform suite, as well as the other services we used to automate our training pipeline. Here are the GCP services we used:

AI Platform, to host the training of the model
Cloud Storage, to host the files that are needed for the training along with the model file that will be exported after the training
Cloud Registry, to host the Docker Image containing the training code
(Optional) Compute Engine, to build and run the Docker Image on a Virtual Machine equipped with a GPU

Google Cloud SDK, Docker and Nvidia-docker need to be installed and set-up on the machine where the Docker image is built. The point of installing Nvidia-docker is to be able to run the built Docker Image directly on the GPU of the machine (if there is one), to ensure that there are no errors in the code and that the training will run as expected when running on AI Platform.

As we’ll see later in the article, the Docker Image has been created from Nvidia-Cuda docker image, so the required Nvidia drivers are automatically installed when building the image.

Context

We’ll see in this article how we automated the training of a text classifier made with FastAI, a library allowing users to create powerful models thanks to ULM FiT method.

Example of classification and ULM FiT Workflow

We already presented it in another Medium article, so I invite you to check it out if you want to know more about it.

Since what we’ll see in this article will be applicable to any framework, you don’t need to be familiar with FastAI to continue reading. All you need to know is that we used a pre-trained model to train our text-classifier in addition to a training labelled dataset.

How we set-up the training with AIPlatform

To automate model training with AI Platform, you need to specify which code should be run in which environment when the training command is called. The best way to do so is to create a Docker Image that contains all the training code and its environment, so AI Platform just has to create a container from this image each time you ask it to train a model. We’ll see in this part how we’ve done it.

Store all necessary files in a GCS bucket

Before creating our Docker image that contains our training, we had to think about the files that are used during the training of a FastAI text classifier model. We therefore decided to store all the files that were necessary for the training in a GCS bucket, separated in folders for each language, with a specific name given to each file.

We then implemented in our training code (as we’ll see below) a method to retrieve those required files from GCS, by only specifying the target language as argument.

Write the training code

After uploading the necessary files in GCS, we created a repo containing the code for our models training, meant to be stored in a Docker Image later on.

As you can see in the linked repo, we divided the code for the training into separate files to handle properly all the training pipeline.

We defined a file that executes all the training workflow as follows:

Retrieve the arguments specified by the user by calling the get_args() method from args_getter.py
Call the FastAI training function and retrieve the trained model that has been saved locally
Upload the trained model to GCS along with its performances stored in a .json

fastai_train.py is the only file directly using methods from FastAI, so if someone wanted to deploy another framework for their training, they’d just have to modify this file (and the content of the config file of course).

The next step was then to create a docker image that will contain everything necessary for the training to be run correctly.

Create the Dockerfile

After preparing the training code, a Dockerfile needed to be created and pushed to Google Container Registry, to enable AI Platform to retrieve and execute it in the right environment.

Since the training of our model needed to run on GPU, we imported Nvidia-cuda Docker image to create our own, so all the necessary drivers were already installed.

# Dockerfile
FROM nvidia/cuda:10.2-devel

RUN apt-get update && apt-get install -y --no-install-recommends 
    wget 
    build-essential

RUN apt-get update && apt-get install -y --no-install-recommends 
    python3-dev 
    python3-setuptools 
    python3-pip

RUN pip3 install pip==20.3.1

WORKDIR /root

# Create directories to contain code and downloaded model from GCS
RUN mkdir /root/trainer

RUN mkdir /root/models

# Copy requirements
COPY requirements.txt /root/requirements.txt

# Install pytorch
RUN pip3 install torch==1.8.0

# Install requirements
RUN pip3 install -r requirements.txt

# Installs google cloud sdk, this is mostly for using gsutil to export model.
RUN wget -nv 
    https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.tar.gz && 
    mkdir /root/tools && 
    tar xvzf google-cloud-sdk.tar.gz -C /root/tools && 
    rm google-cloud-sdk.tar.gz && 
    /root/tools/google-cloud-sdk/install.sh --usage-reporting=false 
        --path-update=false --bash-completion=false 
        --disable-installation-options && 
    rm -rf /root/.config/* && 
    ln -s /root/.config /config && 
    # Remove the backup directory that gcloud creates
    rm -rf /root/tools/google-cloud-sdk/.install/.backup

# Copy files
COPY trainer/fastai_train.py /root/trainer/fastai_train.py

COPY trainer/fastai_config.py /root/trainer/fastai_config.py

COPY trainer/args_getter.py /root/trainer/args_getter.py

COPY trainer/gcs_utils.py /root/trainer/gcs_utils.py

COPY trainer/training_workflow.py /root/trainer/training_workflow.py

# Path configuration
ENV PATH $PATH:/root/tools/google-cloud-sdk/bin

# Make sure gsutil will use the default service account
RUN echo '[GoogleCompute]nservice_account = default' > /etc/boto.cfg

# Authentificate to GCP
CMD gcloud auth login

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python3",  "trainer/training_workflow.py"]

As you can see above, the Dockerfile executes the following steps to create the image:

First, the Nvidia docker image is imported to create the training image
Then, the essential commands to set-up the environment are executed, and python and pip are downloaded
The requirements file is copied
Before installing the requirements, pytorch needs to be installed (to avoid errors due to FastAI installing the wrong version of torch if it’s not detected in the environment). It’s really important to choose pytorch’s corresponding version depending on the cuda image that is used. To know which one to use, check pytorch.org
After these steps, the requirements and the Google Cloud SDK are installed, and the other python files are copied in the image.
The service account to be used by default to authenticate to GCP is specified.
Finally, the training of the model is run by calling training_workflow.py.

Build the image and push it to GCR

After creating the Dockerfile, it was necessary to build the image to push it to GCR. As specified in the repo, various local variables needed to be defined, such as the IMAGE_URI in GCR, the REGION our operates in, etc.

The image was built by running this command:

docker build -f Dockerfile -t $IMAGE_URI ./

Before pushing it to GCR, we wanted to ensure that everything would work fine when calling the training, so since our VM had a GPU available, we ran the image prior pushing it to see what happened:

docker run --runtime=nvidia $IMAGE_URI --epochs 1 --bucket-name $BUCKET_NAME

This step is not necessary but can save you a lot of time because you’ll directly see if there are errors in your code.

We finally pushed the container to GCR by running the following command, $IMAGE_URI being the variable referring to the URI where the image is stored in GCR:

docker push $IMAGE_URI

Run and follow the job

After following the previous steps, the training of the model was ready to be called using a simple command in the terminal. The Google Cloud SDK just needed to be enabled and local variables defined:

gcloud ai-platform jobs submit training $JOB_NAME 
--scale-tier BASIC_GPU 
--region $REGION 
--master-image-uri $IMAGE_URI 
-- 
--lang=fr 
--epochs=10 
--bucket-name=$BUCKET_NAME 
--model-dir=$MODEL_DIR

This command asks AI Platform to retrieve the container in GCR using its IMAGE_URI, and then run the training on the GPU of a machine hosted in the region REGION.

We specified various arguments here, such as the language of the training, the number of epochs, the bucket name to upload and download files in GCS, and the directory where to store the trained model. Those are the arguments that are retrieved by the args_getter.py file

After running the command, the training started and a job was created in the AI Platform console, allowing us to follow the evolution of the training and check the logs of the machine running it.

Many information are accessible when looking at the job in AI Platform console

When the job is complete, the trained model is saved on a GCS bucket as a .pth file along with a .json file containing its performance.

Folder where the files are stored after the training

We decided to consider the results on labels separately, as if we had a binary classifier for each possible label.

Since we used FastAI, the model file could directly be loaded in any environment by calling FastAI’s load_learner() method.

Key take-aways

Automating the training of our models using AI Platform allowed us to save a lot of time, and made us consider various aspects of our model training to efficiently put it into production.

Here are some take-aways we gathered from this:

Building and running the Docker image directly on your machine before pushing it to GCR can save a lot of time of debugging, but be sure to have a machine powerful enough to do it (such as Google’s VMs with GPU)
The arguments and parameters must be well defined, as well as their default value, to simplify the training process while granting users the ability to pimp it if they want to
FastAI is a great ML library, and AI Platform Training is perfectly fitted to automate the training of its models. What is even greater is the fact that the same approach can be applied to any framework, just by modifying the code of the training itself in the repo

What’s next?

Now that we’ve explained how we automated the training of our model, we can now show you how we managed to make it callable easily to classify files on demand.

Stay tuned for the second part of this article that will explain everything you need to know to deploy your trained classifiers using AI Platform and TorchServe!

Thanks for reading!

We hope you’ve learned something today, and that it will be useful for your future ML projects. Feel free to reach us if you have any question or comment regarding this topic.

You can find more about us and our projects on our Medium blog

View Articles