Train DeepRacer model locally with GPU support

8 min readJul 2, 2019

DeepRacer is fun. At the same time it could get really expensive. I just run less than a dozen of training and simulation to prepare the race at AWS Hong Kong Summit and I am getting a bill like this.

When I was lining up for a chance to put the model on a real car to run on the track during the Hong Kong Summit, I talked to several guys and they all spent more than I did. One guy even spent like $700 dollars in just a few weeks. DeepRacer is really a nice platform to learn reinforcement learning but I don’t think it is even remotely reasonable for anyone to spend $100 just to get how it works.

Note that the cost of RoboMaker is especially ridiculous, given that it is just running a simulated environment for the SageMaker to interact with it.

Luckily, Chris Rhodes has shared his project on Github to train the DeepRacer model on your local PC.

In this article, I will show you how to train a model locally and upload the model back to AWS DeepRacer Console and run an evaluation using the model you have uploaded. I will also share my experience of using the GPU to train the model faster.

Training the model

Before we start, I would recommend you to use Linux if possible. I first started to run this on a Mac and there are some issues that you need to fix. Someone also wrote another guide to run this on OSX but I haven’t tried so I cannot confirm whether that works or not. In this article, I will be using Ubuntu 18.04 LTS.

These are the steps that you need to do in order to train the model locally:

Setup python and project environment
Install Docker and configure nvidia as default runtime
Rebuild the docker images
Install minio as a local S3 provider
Start Sagemaker
Start Robomaker

Setup Python environment

Follow this article from Digital Ocean to get your conda environment up and running. The reason I like Conda is that you don’t have to remember where you store your virtual environment. By doing a simple conda env list you can list all environments you have ever created.

So let’s create a virtual environment

conda create --name sagemaker python=3.6
conda activate sagemaker
conda install -c conda-forge awscli

Clone the project from Github

git clone --recurse-submodules https://github.com/crr0004/deepracer.git

Instead SageMaker dependency (under project root)

cd deepracer
pip install -U sagemaker-python-sdk/ pandas 
pip install urllib3==1.24.3 #Fix some dependency issue
pip install PyYAML==3.13 #Fix some dependency issue
pip install ipython

Install Docker and configure nvidia as default runtime

Install docker by following the instruction on the docker website.

Since we want to use GPU when we train the model, we need to ask docker to use the Nvidia docker as the default runtime. Refer to this guide to config Nvidia to finish the setup or simply run the following command

# Update the default configuration and restart
pushd $(mktemp -d)
(sudo cat /etc/docker/daemon.json 2>/dev/null || echo '{}') | \
    jq '. + {"default-runtime": "nvidia"}' | \
    tee tmp.json
sudo mv tmp.json /etc/docker/daemon.json
popd
sudo systemctl restart docker

# No need for nvidia-docker or --engine=nvidia
docker run --rm -it nvidia/cuda nvidia-smi

You should see something like this if you set it up correctly

Rebuild the docker images with GPU support

As of now there is no pre-built official images that support GPU, we are going to rebuild those docker images with GPU support. You can skip this step if you aren’t planning to use a GPU. If you just want to use my image, just pull this image:

docker pull sctse999/sagemaker-rl-tensorflow

Alternatively, you can build the image by following the steps below:

Build sagemaker-tensorflow-scriptmode:

cd sagemaker-tensorflow-container/docker/1.12.0python3 setup.py sdistcp dist/sagemaker_tensorflow_container-2.0.0.tar.gz docker/1.12.0/cd docker/1.12.0wget https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.12.0-cp36-cp36m-linux_x86_64.whldocker build -t 520713654638.dkr.ecr.us-west-2.amazonaws.com/sagemaker-tensorflow-scriptmode:1.12.0-gpu-py3 --build-arg py_version=3 --build-arg framework_installable=tensorflow_gpu-1.12.0-cp36-cp36m-linux_x86_64.whl -f Dockerfile.gpu .

You should see something like this when you run docker images

Next, build the sagemaker-container

cd sagemaker-containerspython3 setup.py sdistcp dist/sagemaker_containers-2.4.4.post2.tar.gz ../sagemaker-rl-container/cd ../sagemaker-rl-containerdocker build -t 520713654638.dkr.ecr.us-east-1.amazonaws.com/sagemaker-rl-tensorflow:coach0.11-cpu-py3 --build-arg sagemaker_container=sagemaker_containers-2.4.4.post2.tar.gz --build-arg processor=gpu -f ./coach/docker/0.11.0/Dockerfile.tf .

You may ask why we are using the tag coach0.11-cpu-py3 instead of another name. It is because this is the default image tag that the RLEstimator will look for. If you want to use your own image name, it’s alright and I will show you how to use your own image name later below.

Install minio as a local S3 service

If you refer to the architecture of DeepRacer, S3 is used as a bridge for the communication between SageMaker and RoboMaker.

Therefore, we need to install minio, a local version of S3. While you can run minio as a docker container, I would recommend you to run it just a service of ubuntu. I have some bad experience when running it as a docker container.

Download the executable from the official

wget https://dl.min.io/server/minio/release/linux-amd64/minio
chmod +x minio
sudo mv minio /usr/local/bin

Put the config file in place

$ sudo vi /etc/default/minio

Paste the following in the file. Remember to edit <<YOUR_PROJECT_PATH>> with a correct path

# Volume to be used for MinIO server.
MINIO_VOLUMES="<<YOUR_PROJECT_PATH>>/data"
# Access Key of the server.
MINIO_ACCESS_KEY=minio
# Secret key of the server.
MINIO_SECRET_KEY=miniokey

Setup the minio as a service

curl -O https://raw.githubusercontent.com/minio/minio-service/master/linux-systemd/minio.servicesudo mv minio.service /etc/systemd/system
systemctl enable minio.service

Note: For Ubuntu 16.04 LTS, you need to change the user and group to ubuntu

Go to http://localhost:9000 to check the installation. Login using the credential you set above (i.e. minio/miniokey) and you should see this.

Use the button on the bottom right to create a bucket

Create a bucket named bucket.

Start SageMaker

Copy SageMaker config file

mkdir -p ~/.sagemaker && cp config.yaml ~/.sagemaker

Create a docker network named sagemaker-local

docker network create sagemaker-local

Check the subnet of the docker network

docker network inspect sagemaker-local

Edit rl_coach/env.sh, comment S3_ENDPOINT_URL and use the line below.

# export S3_ENDPOINT_URL=http://$(hostname -i):9000
export S3_ENDPOINT_URL=http://172.18.0.1:9000

Source the env.sh

cd rl_coach
source env.sh

Before you start SageMaker, copy your local reward function and model_metadata.json to S3 (minio). If you want, you can edit the reward function before copying to S3.

aws --endpoint-url $S3_ENDPOINT_URL s3 cp ../custom_files s3://bucket/custom_files  --recursive

Optionally, use your own image created above to utilize the GPU. Comment out the line image_name in rl_deepracer_coach_robomaker.py.

base_job_name=job_name_prefix,
# image_name="crr0004/sagemaker-rl-tensorflow:console",
train_max_run=job_duration_in_seconds, # Maximum runtime in second

The RLEstimator will look for the default image, which is named 520713654638.dkr.ecr.us-east-1.amazonaws.com/sagemaker-rl-tensorflow:coach0.11-cpu-py3

Run SageMaker

python rl_deepracer_coach_robomaker.py

If everything run smooth, you should see something like this:

At this point, the SageMaker is waiting for the RoboMaker to start the training

Start RoboMaker

RoboMaker is the service that provide the simulation environment. Before we start, we need to config the track to be used. You can edit the robomaker.env to choose the track you want to use. The default track is set to Tokyo track for now. If you want to use the reinvent track, you can set the reinvent_base

# WORLD_NAME=Tokyo_Training_track
WORLD_NAME=reinvent_base

In addition, if your docker network is something different from 172.18.0.1, remember to change the parameter S3_ENDPOINT_URL as well.

S3_ENDPOINT_URL=http://172.18.0.1:9000

If everything is good, run this command under the project directory

docker run --rm --name dr --env-file ./robomaker.env --network sagemaker-local -p 8080:5900 -it crr0004/deepracer_robomaker:console

If you see something like this, congratulation you made it!

On the other hand, you should see the SageMaker outputting something like this:

You can also check the simulation process by using vnc client to connect to RoboMaker

gvncviewer localhost:2180

Submit the locally trained model to DeepRacer Console

By following this tutorial, you first create a 5-minute dummy training job. Then the necessary folders on S3 will be created for you. You could upload your locally trained files to the S3 bucket.

And go back to the DeepRacer console and start the evaluation. It works.

After the evaluation is finished, you could also download the model and supposingly put it on a DeepRacer to run. However, because I don’t have a DeepRacer vehicle yet (I ordered back in 2018 Nov and AWS has NOT delivered it to me for over 7 months!), I cannot confirm whether it works or not. Let me know if you have done this and I will supplement here so everyone know.

Other Issues

I am experiencing OOM error on both my local machine and an AWS g3s.xlarge instance. I observe that probably the memory is not enough when tensorflow want to get more memory. Chris said there is a memory leak.

I decided to try on a p3.2xlarge which get a tesla V100 with 16GB (i thought it is 32GB) hoping that it won’t get a OOM too soon and I can have the model trained before the OOM happen. Surprisingly, it looks like it wasn’t a memory leak because the v100 could run for a few hours without any issue.

So if you are experiencing OOM you have two choices:

Train locally using CPU
Train with a p3.2xlarge instance

Tips: You can save money by using spot instance. To create a spot instance easily, use it together with AMI which allow you to create spot instance relatively easier

Someone on stackoverflow said OOM issue could be avoided by setting a smaller batch size. I tried but it doesn’t work for me.

Way Forward

Fix the OOM issue

This is obviously the first priority. Let’s see if guys from AWS would get this fixed very soon.

Train based on an existing model

Someone has done this already. I didn’t try so I can’t comment. Hopefully I will supplement to this article when I try it. You may find more information on how to do this on the issues or wiki page.

A local web console

There are still a lot of manual work to train a model and upload it to DeepRacer console. If a local web console is available, it would make this much easier to use.

Credits

Chris Rhodes for creating this project
@legg-it — He showed me how to build the docker images in this github issue