Train DeepRacer model locally with GPU support

DeepRacer is fun. At the same time it could get really expensive. I just run less than a dozen of training and simulation to prepare the race at AWS Hong Kong Summit and I am getting a bill like this.

I spent more than $200 in 14 days

When I was lining up for a chance to put the model on a real car to run on the track during the Hong Kong Summit, I talked to several guys and they all spent more than I did. One guy even spent like $700 dollars in just a few weeks. DeepRacer is really a nice platform to learn reinforcement learning but I don’t think it is even remotely reasonable for anyone to spend $100 just to get how it works.

Note that the cost of RoboMaker is especially ridiculous, given that it is just running a simulated environment for the SageMaker to interact with it.

Luckily, Chris Rhodes has shared his project on Github to train the DeepRacer model on your local PC.

In this article, I will show you how to train a model locally and upload the model back to AWS DeepRacer Console and run an evaluation using the model you have uploaded. I will also share my experience of using the GPU to train the model faster.

Training the model

These are the steps that you need to do in order to train the model locally:

  1. Setup python and project environment
  2. Install Docker and configure nvidia as default runtime
  3. Rebuild the docker images
  4. Install minio as a local S3 provider
  5. Start Sagemaker
  6. Start Robomaker

Setup Python environment

So let’s create a virtual environment

conda create --name sagemaker python=3.6
conda activate sagemaker
conda install -c conda-forge awscli

Clone the project from Github

git clone --recurse-submodules https://github.com/crr0004/deepracer.git

Instead SageMaker dependency (under project root)

cd deepracer
pip install -U sagemaker-python-sdk/ pandas
pip install urllib3==1.24.3 #Fix some dependency issue
pip install PyYAML==3.13 #Fix some dependency issue
pip install ipython

Install Docker and configure nvidia as default runtime

Since we want to use GPU when we train the model, we need to ask docker to use the Nvidia docker as the default runtime. Refer to this guide to config Nvidia to finish the setup or simply run the following command

# Update the default configuration and restart
pushd $(mktemp -d)
(sudo cat /etc/docker/daemon.json 2>/dev/null || echo '{}') | \
jq '. + {"default-runtime": "nvidia"}' | \
tee tmp.json
sudo mv tmp.json /etc/docker/daemon.json
popd
sudo systemctl restart docker

# No need for nvidia-docker or --engine=nvidia
docker run --rm -it nvidia/cuda nvidia-smi

You should see something like this if you set it up correctly

Rebuild the docker images with GPU support

docker pull sctse999/sagemaker-rl-tensorflow

Alternatively, you can build the image by following the steps below:

Build sagemaker-tensorflow-scriptmode:

cd sagemaker-tensorflow-container/docker/1.12.0python3 setup.py sdistcp dist/sagemaker_tensorflow_container-2.0.0.tar.gz docker/1.12.0/cd docker/1.12.0wget https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.12.0-cp36-cp36m-linux_x86_64.whldocker build -t 520713654638.dkr.ecr.us-west-2.amazonaws.com/sagemaker-tensorflow-scriptmode:1.12.0-gpu-py3 --build-arg py_version=3 --build-arg framework_installable=tensorflow_gpu-1.12.0-cp36-cp36m-linux_x86_64.whl -f Dockerfile.gpu .

You should see something like this when you run docker images

Next, build the sagemaker-container

cd sagemaker-containerspython3 setup.py sdistcp dist/sagemaker_containers-2.4.4.post2.tar.gz ../sagemaker-rl-container/cd ../sagemaker-rl-containerdocker build -t 520713654638.dkr.ecr.us-east-1.amazonaws.com/sagemaker-rl-tensorflow:coach0.11-cpu-py3 --build-arg sagemaker_container=sagemaker_containers-2.4.4.post2.tar.gz --build-arg processor=gpu -f ./coach/docker/0.11.0/Dockerfile.tf .

You may ask why we are using the tag coach0.11-cpu-py3 instead of another name. It is because this is the default image tag that the RLEstimator will look for. If you want to use your own image name, it’s alright and I will show you how to use your own image name later below.

Install minio as a local S3 service

Therefore, we need to install minio, a local version of S3. While you can run minio as a docker container, I would recommend you to run it just a service of ubuntu. I have some bad experience when running it as a docker container.

Download the executable from the official

wget https://dl.min.io/server/minio/release/linux-amd64/minio
chmod +x minio
sudo mv minio /usr/local/bin

Put the config file in place

$ sudo vi /etc/default/minio

Paste the following in the file. Remember to edit <<YOUR_PROJECT_PATH>> with a correct path

# Volume to be used for MinIO server.
MINIO_VOLUMES="<<YOUR_PROJECT_PATH>>/data"
# Access Key of the server.
MINIO_ACCESS_KEY=minio
# Secret key of the server.
MINIO_SECRET_KEY=miniokey

Setup the minio as a service

curl -O https://raw.githubusercontent.com/minio/minio-service/master/linux-systemd/minio.servicesudo mv minio.service /etc/systemd/system
systemctl enable minio.service

Note: For Ubuntu 16.04 LTS, you need to change the user and group to ubuntu

Go to http://localhost:9000 to check the installation. Login using the credential you set above (i.e. minio/miniokey) and you should see this.

Use the button on the bottom right to create a bucket

Create a bucket named bucket.

Start SageMaker

mkdir -p ~/.sagemaker && cp config.yaml ~/.sagemaker

Create a docker network named sagemaker-local

docker network create sagemaker-local

Check the subnet of the docker network

docker network inspect sagemaker-local

Edit rl_coach/env.sh, comment S3_ENDPOINT_URL and use the line below.

# export S3_ENDPOINT_URL=http://$(hostname -i):9000
export S3_ENDPOINT_URL=http://172.18.0.1:9000

Source the env.sh

cd rl_coach
source env.sh

Before you start SageMaker, copy your local reward function and model_metadata.json to S3 (minio). If you want, you can edit the reward function before copying to S3.

aws --endpoint-url $S3_ENDPOINT_URL s3 cp ../custom_files s3://bucket/custom_files  --recursive

Optionally, use your own image created above to utilize the GPU. Comment out the line image_name in rl_deepracer_coach_robomaker.py.

base_job_name=job_name_prefix,
# image_name="crr0004/sagemaker-rl-tensorflow:console",
train_max_run=job_duration_in_seconds, # Maximum runtime in second

The RLEstimator will look for the default image, which is named 520713654638.dkr.ecr.us-east-1.amazonaws.com/sagemaker-rl-tensorflow:coach0.11-cpu-py3

Run SageMaker

python rl_deepracer_coach_robomaker.py

If everything run smooth, you should see something like this:

At this point, the SageMaker is waiting for the RoboMaker to start the training

Start RoboMaker

# WORLD_NAME=Tokyo_Training_track
WORLD_NAME=reinvent_base

In addition, if your docker network is something different from 172.18.0.1, remember to change the parameter S3_ENDPOINT_URL as well.

S3_ENDPOINT_URL=http://172.18.0.1:9000

If everything is good, run this command under the project directory

docker run --rm --name dr --env-file ./robomaker.env --network sagemaker-local -p 8080:5900 -it crr0004/deepracer_robomaker:console

If you see something like this, congratulation you made it!

On the other hand, you should see the SageMaker outputting something like this:

You can also check the simulation process by using vnc client to connect to RoboMaker

gvncviewer localhost:2180

Submit the locally trained model to DeepRacer Console

And go back to the DeepRacer console and start the evaluation. It works.

After the evaluation is finished, you could also download the model and supposingly put it on a DeepRacer to run. However, because I don’t have a DeepRacer vehicle yet (I ordered back in 2018 Nov and AWS has NOT delivered it to me for over 7 months!), I cannot confirm whether it works or not. Let me know if you have done this and I will supplement here so everyone know.

Other Issues

I decided to try on a p3.2xlarge which get a tesla V100 with 16GB (i thought it is 32GB) hoping that it won’t get a OOM too soon and I can have the model trained before the OOM happen. Surprisingly, it looks like it wasn’t a memory leak because the v100 could run for a few hours without any issue.

So if you are experiencing OOM you have two choices:

  1. Train locally using CPU
  2. Train with a p3.2xlarge instance

Tips: You can save money by using spot instance. To create a spot instance easily, use it together with AMI which allow you to create spot instance relatively easier

Someone on stackoverflow said OOM issue could be avoided by setting a smaller batch size. I tried but it doesn’t work for me.

Way Forward

Fix the OOM issue

Train based on an existing model

A local web console

Credits

  1. @legg-it — He showed me how to build the docker images in this github issue

Love Self-driving technology and machine learning. Community leader in DIYRobocar Hong Kong.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store