14 Commits

10 changed files with 355 additions and 178 deletions

2
.dockerignore Normal file
View File

@ -0,0 +1,2 @@
venv

2
.gitignore vendored Normal file
View File

@ -0,0 +1,2 @@
venv
/src/robocars_sagemaker_container.egg-info/

15
Dockerfile Normal file
View File

@ -0,0 +1,15 @@
FROM docker.io/tensorflow/tensorflow:2.6.0
COPY requirements.txt .
RUN pip3 install --upgrade pip==20.0.2 && pip3 list && pip3 install -r requirements.txt \
&& pip3 list
WORKDIR /root
# copy the training script inside the container
COPY src/tf_container/train.py /opt/ml/code/train.py
# define train.py as the script entry point
ENV SAGEMAKER_PROGRAM train.py

View File

@ -1,36 +1,15 @@
FROM python:3.5 as builder
FROM docker.io/tensorflow/tensorflow:2.6.0-gpu
RUN mkdir -p /usr/src
ADD . /usr/src
WORKDIR /usr/src
RUN python3 setup.py sdist
FROM tensorflow/tensorflow:1.8.0-gpu-py3
#tensorflow-serving-api-python3==1.7.0
RUN pip3 list && pip3 install numpy boto3 six awscli flask==0.11 Jinja2==2.9 gevent gunicorn keras==2.1.3 pillow h5py \
COPY requirements.txt .
RUN pip3 install --upgrade pip==20.0.2 && pip3 list && pip3 install -r requirements.txt \
&& pip3 list
WORKDIR /root
RUN apt-get -y update && \
apt-get -y install curl && \
apt-get -y install vim && \
apt-get -y install iputils-ping && \
apt-get -y install nginx
# copy the training script inside the container
COPY src/tf_container/train.py /opt/ml/code/train.py
# install telegraf
RUN cd /tmp && \
curl -O https://dl.influxdata.com/telegraf/releases/telegraf_1.4.2-1_amd64.deb && \
dpkg -i telegraf_1.4.2-1_amd64.deb && \
cd -
# define train.py as the script entry point
ENV SAGEMAKER_PROGRAM train.py
COPY --from=builder /usr/src/dist/robocars_sagemaker_container-1.0.0.tar.gz .
RUN pip3 install robocars_sagemaker_container-1.0.0.tar.gz
RUN rm robocars_sagemaker_container-1.0.0.tar.gz
ENTRYPOINT ["entry.py"]

View File

@ -2,31 +2,34 @@
Run DIY Robocars model training as Sagemaker (https://aws.amazon.com/fr/sagemaker/) task. Estimated cost for one training (as of August 2018): 0.50 EUR
# Build images
## AWS usage
### Build images
- Build model image:
```
```bash
docker build -t robocars:1.8.0-gpu-py3 -f Dockerfile.gpu .
```
# Prepare training (once)
### Prepare training (once)
- Create a S3 bucket for your tubes. You can use the same for model output or create another bucker for output
- Create an AWS docker registry and push your model image to it. Docker hub registry is not supported
```
```bash
docker tag robocars:1.8.0-gpu-py <replace_me>.dkr.ecr.eu-west-1.amazonaws.com/robocars:1.8.0-gpu-py3
# you should have AWS SDK installed and login to docker
docker push <replace_me>.dkr.ecr.eu-west-1.amazonaws.com/robocars:1.8.0-gpu-py3
```
# Run training
### Run training
- Copy your tubes to your S3 bucket. All tubes in the bucket will be used for training so make sure you keep only relevant files. We recommend to zip your tubes before upload. The training package will unzip them.
- Create a training job on AWS Sagemaker. Use create_job.sh script after replacing relevant parameters
```
```bash
#!/bin/bash
#usage: create_job.sh some_job_unique_name
@ -51,7 +54,7 @@ aws sagemaker create-training-job \
- Keep an eye on job progression on AWS Sagemaker. Once finished your model is copied into the destination bucket.
# About AWS Sagemaker
### About AWS Sagemaker
Sagemaker provide on-demand model computing and serving. Standard algorithms can be used and on-demande Jupyter notebooks are available. However, as any hosted service, tensorflow versions are updated frequently which is not manageable because compatible versions might not be available on RaspberryPi. Sagemaker also allow "Bring Your Own Algorithm" by using a docker image for training. The resulting container must comply to Sagemaker constraints.
@ -59,9 +62,36 @@ Input and output data are mapped to S3 buckets: at container start, input data i
Hyperparameters can be sent at job creation time and accessed by training code (example: ```env.hyperparameters.get('with_slide', False)```)
# Which Tensorflow version should I pick ?
### Which Tensorflow version should I pick ?
Version 1.4.1 model is compatible with 1.8.0 tensorflow runtime
Version 1.8.0 model is not compatible with previous tensorflow runtimes
## Local run
Run training locally with podman
### Run training with podman
1. build image
```bash
podman build . -t tensorflow_without_gpu
```
2. Make archive (See [rc-tools](https://git.cyrilix.bzh/robocars/robocar-tools))
```bash
go run ./cmd/rc-tools training archive -record-path ~/robocar/record-sim2 -output /tmp/train.zip -image-height 120 -image-width 160 --horizon 20 -with-flip-image
```
3. Run training
```bash
podman run --rm -it -v /tmp/data:/opt/ml/input/data/train -v /tmp/output:/opt/ml/model/ localhost/tensorflow_without_gpu python /opt/ml/code/train.py --img_height=100 --img_width=160 --batch_size=32
```
```bash
podman run --rm -it -v /tmp/data:/opt/ml/input/data/train -v /tmp/output:/opt/ml/model/ localhost/tensorflow_without_gpu python /opt/ml/code/train.py --img_height=256 --img_width=320 --batch_size=32
```

View File

@ -1,22 +1,24 @@
#!/bin/bash
job_name=$1
if [ -z $job_name ]
if [[ -z ${job_name} ]]
then
echo 'Provide model name'
exit 0
fi
echo 'Creating training job '$1
training_image="<replace_me>.dkr.ecr.eu-west-1.amazonaws.com/robocars:1.8.0-gpu-py3"
iam_role_arn="arn:aws:iam::<replace_me>:role/service-role/<replace_me>"
training_image="117617958416.dkr.ecr.eu-west-1.amazonaws.com/robocars:tensorflow"
iam_role_arn="arn:aws:iam::117617958416:role/robocar-training"
DATA_BUCKET="s3://robocars-cyrilix-learning/input"
DATA_OUTPUT="s3://robocars-cyrilix-learning/output"
aws sagemaker create-training-job \
--training-job-name $job_name \
--hyper-parameters '{ "sagemaker_region": "\"eu-west-1\"", "with_slide": "true" }' \
--algorithm-specification TrainingImage=$training_image,TrainingInputMode=File \
--role-arn $iam_role_arn \
--input-data-config '[{ "ChannelName": "train", "DataSource": { "S3DataSource": { "S3DataType": "S3Prefix", "S3Uri": "s3://<replace_me>", "S3DataDistributionType": "FullyReplicated" }} }]' \
--output-data-config S3OutputPath=s3://<replace_me> \
--training-job-name ${job_name} \
--hyper-parameters '{ "sagemaker_region": "\"eu-west-1\"", "with_slide": "true", "img_height": "120", "img_width": "160" }' \
--algorithm-specification TrainingImage="${training_image}",TrainingInputMode=File \
--role-arn ${iam_role_arn} \
--input-data-config "[{ \"ChannelName\": \"train\", \"DataSource\": { \"S3DataSource\": { \"S3DataType\": \"S3Prefix\", \"S3Uri\": \"${DATA_BUCKET}\", \"S3DataDistributionType\": \"FullyReplicated\" }} }]" \
--output-data-config S3OutputPath=${DATA_OUTPUT} \
--resource-config InstanceType=ml.p2.xlarge,InstanceCount=1,VolumeSizeInGB=1 \
--stopping-condition MaxRuntimeInSeconds=1800

4
requirements.txt Normal file
View File

@ -0,0 +1,4 @@
sagemaker-training==3.9.2
tensorflow==2.6.0
numpy==1.19.5
pillow==8.3.2

View File

@ -1,8 +1,8 @@
import os
from glob import glob
from os.path import basename
from os.path import splitext
from glob import glob
from setuptools import setup, find_packages
@ -19,9 +19,13 @@ setup(
py_modules=[splitext(basename(path))[0] for path in glob('src/*.py')],
classifiers=[
'Programming Language :: Python :: 3.5',
'Programming Language :: Python :: 3.7',
],
entry_points={
'console_scripts': [
'train=tf_container.train_entry_point:train',
]
},
install_requires=['sagemaker-container-support'],
extras_require={},
)

265
src/tf_container/train.py Normal file
View File

@ -0,0 +1,265 @@
#!/usr/bin/env python3
import os
# import container_support as cs
import argparse
import json
import numpy as np
import re
import tensorflow as tf
import zipfile
# from tensorflow.keras import backend as K
from tensorflow.keras import callbacks
from tensorflow.keras.layers import Convolution2D
from tensorflow.keras.layers import Dropout, Flatten, Dense
from tensorflow.keras.layers import Input
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from tensorflow.python.client import device_lib
MODEL_CATEGORICAL = "categorical"
MODEL_LINEAR = "linear"
def linear_bin(a: float, N: int = 15, offset: int = 1, R: float = 2.0):
"""
create a bin of length N
map val A to range R
offset one hot bin by offset, commonly R/2
"""
a = a + offset
b = round(a / (R / (N - offset)))
arr = np.zeros(N)
b = clamp(b, 0, N - 1)
arr[int(b)] = 1
return arr
def clamp(n, min, max):
if n <= min:
return min
if n >= max:
return max
return n
def get_data(root_dir, filename):
print('load data from file ' + filename)
d = json.load(open(os.path.join(root_dir, filename)))
return [(d['user/angle']), root_dir, d['cam/image_array']]
numbers = re.compile(r'(\d+)')
def unzip_file(root, f):
zip_ref = zipfile.ZipFile(os.path.join(root, f), 'r')
zip_ref.extractall(root)
zip_ref.close()
def train(model_type: str, batch_size: int, slide_size: int, img_height: int, img_width: int, img_depth: int, horizon: int, drop: float):
# env = cs.TrainingEnvironment()
print(device_lib.list_local_devices())
os.system('mkdir -p logs')
# ### Loading the files ###
# ** You need to copy all your files to the directory where you are runing this notebook **
# ** into a folder named "data" **
data = []
for root, dirs, files in os.walk('/opt/ml/input/data/train'):
for f in files:
if f.endswith('.zip'):
unzip_file(root, f)
for root, dirs, files in os.walk('/opt/ml/input/data/train'):
data.extend(
[get_data(root, f) for f in sorted(files, key=str.lower) if f.startswith('record') and f.endswith('.json')])
# ### Loading throttle and angle ###
angle = [d[0] for d in data]
angle_array = np.array(angle)
# ### Loading images ###
if horizon > 0:
images = np.array([img_to_array(load_img(os.path.join(d[1], d[2])).crop((0, horizon, img_width, img_height))) for d in data], 'f')
else:
images = np.array([img_to_array(load_img(os.path.join(d[1], d[2]))) for d in data], 'f')
# slide images vs orders
if slide_size > 0:
images = images[:len(images) - slide_size]
angle_array = angle_array[slide_size:]
# ### Start training ###
from datetime import datetime
logdir = '/opt/ml/model/logs/' + datetime.now().strftime("%Y%m%d-%H%M%S")
logs = callbacks.TensorBoard(log_dir=logdir, histogram_freq=0, write_graph=True, write_images=True)
# Creates a file writer for the log directory.
# file_writer = tf.summary.create_file_writer(logdir)
# Using the file writer, log the reshaped image.
# with file_writer.as_default():
# # Don't forget to reshape.
# imgs = np.reshape(images[0:25], (-1, img_height, img_width, img_depth))
# tf.summary.image("25 training data examples", imgs, max_outputs=25, step=0)
model_filepath = '/opt/ml/model/model_other'
if model_type == MODEL_CATEGORICAL:
model_filepath = '/opt/ml/model/model_cat'
angle_cat_array = np.array([linear_bin(float(a)) for a in angle_array])
model = default_categorical(input_shape=(img_height - horizon, img_width, img_depth), drop=drop)
loss = {'angle_out': 'categorical_crossentropy', }
optimizer = 'adam'
elif model_type == MODEL_LINEAR:
model_filepath = '/opt/ml/model/model_lin'
angle_cat_array = np.array([a for a in angle_array])
model = default_linear(input_shape=(img_height - horizon, img_width, img_depth), drop=drop)
loss = 'mse'
optimizer = 'rmsprop'
else:
raise Exception("invalid model type")
save_best = callbacks.ModelCheckpoint(model_filepath, monitor='val_loss', verbose=1,
save_best_only=True, mode='min')
early_stop = callbacks.EarlyStopping(monitor='val_loss',
min_delta=.0005,
patience=5,
verbose=1,
mode='auto')
# categorical output of the angle
callbacks_list = [save_best, early_stop, logs]
model.compile(optimizer=optimizer,
loss=loss,)
model.fit({'img_in': images}, {'angle_out': angle_cat_array, }, batch_size=batch_size,
epochs=100, verbose=1, validation_split=0.2, shuffle=True, callbacks=callbacks_list)
# Save model for tensorflow using
model.save("/opt/ml/model/tfModel", save_format="tf")
def representative_dataset():
for d in tf.data.Dataset.from_tensor_slices(images).batch(1).take(100):
yield [tf.dtypes.cast(d, tf.float32)]
converter = tf.lite.TFLiteConverter.from_keras_model(model)
# full quantization for edgeTpu
# https://www.tensorflow.org/lite/performance/post_training_quantization#full_integer_quantization
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8 # or tf.int8
converter.inference_output_type = tf.uint8 # or tf.int8
tflite_model = converter.convert()
# Save the model.
with open('/opt/ml/model/model_' + model_type + '_' + str(img_width) + 'x' + str(img_height) + 'h' + str(horizon) + '.tflite',
'wb') as f:
f.write(tflite_model)
def conv2d(filters, kernel, strides, layer_num, activation='relu'):
"""
Helper function to create a standard valid-padded convolutional layer
with square kernel and strides and unified naming convention
:param filters: channel dimension of the layer
:param kernel: creates (kernel, kernel) kernel matrix dimension
:param strides: creates (strides, strides) stride
:param layer_num: used in labelling the layer
:param activation: activation, defaults to relu
:return: tf.keras Convolution2D layer
"""
return Convolution2D(filters=filters,
kernel_size=(kernel, kernel),
strides=(strides, strides),
activation=activation,
name='conv2d_' + str(layer_num))
def core_cnn_layers(img_in: Input, img_height: int, img_width: int, drop: float, l4_stride: int = 1):
"""
Returns the core CNN layers that are shared among the different models,
like linear, imu, behavioural
:param img_width: image width
:param img_height: image height
:param img_in: input layer of network
:param drop: dropout rate
:param l4_stride: 4-th layer stride, default 1
:return: stack of CNN layers
"""
x = img_in
x = conv2d(img_height/5, 5, 2, 1)(x)
x = Dropout(drop)(x)
x = conv2d(img_width / 5, 5, 2, 2)(x)
x = Dropout(drop)(x)
x = conv2d(64, 5, 2, 3)(x)
x = Dropout(drop)(x)
x = conv2d(64, 3, l4_stride, 4)(x)
x = Dropout(drop)(x)
x = conv2d(64, 3, 1, 5)(x)
x = Dropout(drop)(x)
x = Flatten(name='flattened')(x)
return x
def default_linear(input_shape=(120, 160, 3), drop=0.2):
img_in = Input(shape=input_shape, name='img_in')
x = core_cnn_layers(img_in, img_width=input_shape[1], img_height=input_shape[0], drop=drop)
x = Dense(100, activation='relu', name='dense_1')(x)
x = Dropout(drop)(x)
x = Dense(50, activation='relu', name='dense_2')(x)
x = Dropout(drop)(x)
angle_out = Dense(1, activation='linear', name='angle_out')(x)
model = Model(inputs=[img_in], outputs=[angle_out], name='linear')
return model
def default_categorical(input_shape=(120, 160, 3), drop=0.2):
img_in = Input(shape=input_shape, name='img_in')
x = core_cnn_layers(img_in, img_width=input_shape[1], img_height=input_shape[0], drop=drop, l4_stride=2)
x = Dense(100, activation='relu', name="dense_1")(x)
x = Dropout(drop)(x)
x = Dense(50, activation='relu', name="dense_2")(x)
x = Dropout(drop)(x)
# Categorical output of the angle into 15 bins
angle_out = Dense(15, activation='softmax', name='angle_out')(x)
model = Model(inputs=[img_in], outputs=[angle_out], name='categorical')
return model
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--slide_size", type=int, default=0)
parser.add_argument("--img_height", type=int, default=120)
parser.add_argument("--img_width", type=int, default=160)
parser.add_argument("--img_depth", type=int, default=3)
parser.add_argument("--horizon", type=int, default=0)
parser.add_argument("--batch_size", type=int, default=32)
parser.add_argument("--drop", type=float, default=0.2)
parser.add_argument("--model_type", type=str, default=MODEL_CATEGORICAL)
args = parser.parse_args()
params = vars(args)
train(
model_type=params["model_type"],
batch_size=params["batch_size"],
slide_size=params["slide_size"],
img_height=params["img_height"],
img_width=params["img_width"],
img_depth=params["img_depth"],
horizon=params["horizon"],
drop=params["drop"],
)

View File

@ -1,126 +0,0 @@
#!/usr/bin/env python3
import container_support as cs
import os
import json
import re
import zipfile
from keras.preprocessing.image import load_img, img_to_array
import numpy as np
from keras.layers import Input, Dense, merge
from keras.models import Model
from keras.layers import Convolution2D, MaxPooling2D, Reshape, BatchNormalization
from keras.layers import Activation, Dropout, Flatten, Dense
from keras import callbacks
from tensorflow.python.client import device_lib
def train():
env = cs.TrainingEnvironment()
print(device_lib.list_local_devices())
os.system('mkdir -p logs')
# ### Loading the files ###
# ** You need to copy all your files to the directory where you are runing this notebook into a folder named "data" **
numbers = re.compile(r'(\d+)')
data = []
def get_data(root,f):
d = json.load(open(os.path.join(root,f)))
if ('pilot/throttle' in d):
return [d['user/mode'],d['user/throttle'],d['user/angle'],root,d['cam/image_array'],d['pilot/throttle'],d['pilot/angle']]
else:
return [d['user/mode'],d['user/throttle'],d['user/angle'],root,d['cam/image_array']]
def numericalSort(value):
parts = numbers.split(value)
parts[1::2] = map(int, parts[1::2])
return parts
def unzip_file(root,f):
zip_ref = zipfile.ZipFile(os.path.join(root,f), 'r')
zip_ref.extractall(root)
zip_ref.close()
for root, dirs, files in os.walk('/opt/ml/input/data/train'):
for f in files:
if f.endswith('.zip'):
unzip_file(root, f)
for root, dirs, files in os.walk('/opt/ml/input/data/train'):
data.extend([get_data(root,f) for f in sorted(files, key=numericalSort) if f.startswith('record') and f.endswith('.json')])
# Normalize / correct data
data = [d for d in data if d[1] > 0.1]
for d in data:
if d[1] < 0.2:
d[1] = 0.2
# ### Loading throttle and angle ###
angle = [d[2] for d in data]
throttle = [d[1] for d in data]
angle_array = np.array(angle)
throttle_array = np.array(throttle)
if (len(data[0]) > 5):
pilot_angle = [d[6] for d in data]
pilot_throttle = [d[5] for d in data]
pilot_angle_array = np.array(pilot_angle)
pilot_throttle_array = np.array(pilot_throttle)
else:
pilot_angle = []
pilot_throttle = []
# ### Loading images ###
images = np.array([img_to_array(load_img(os.path.join(d[3],d[4]))) for d in data],'f')
# slide images vs orders
if env.hyperparameters.get('with_slide', False):
images = images[:len(images)-2]
angle_array = angle_array[2:]
throttle_array = throttle_array[2:]
# ### Start training ###
def linear_bin(a):
a = a + 1
b = round(a / (2/14))
arr = np.zeros(15)
arr[int(b)] = 1
return arr
logs = callbacks.TensorBoard(log_dir='logs', histogram_freq=0, write_graph=True, write_images=True)
save_best = callbacks.ModelCheckpoint('/opt/ml/model/model_cat', monitor='angle_out_loss', verbose=1, save_best_only=True, mode='min')
early_stop = callbacks.EarlyStopping(monitor='angle_out_loss',
min_delta=.0005,
patience=10,
verbose=1,
mode='auto')
img_in = Input(shape=(120, 160, 3), name='img_in') # First layer, input layer, Shape comes from camera.py resolution, RGB
x = img_in
x = Convolution2D(24, (5,5), strides=(2,2), activation='relu')(x) # 24 features, 5 pixel x 5 pixel kernel (convolution, feauture) window, 2wx2h stride, relu activation
x = Convolution2D(32, (5,5), strides=(2,2), activation='relu')(x) # 32 features, 5px5p kernel window, 2wx2h stride, relu activatiion
x = Convolution2D(64, (5,5), strides=(2,2), activation='relu')(x) # 64 features, 5px5p kernal window, 2wx2h stride, relu
x = Convolution2D(64, (3,3), strides=(2,2), activation='relu')(x) # 64 features, 3px3p kernal window, 2wx2h stride, relu
x = Convolution2D(64, (3,3), strides=(1,1), activation='relu')(x) # 64 features, 3px3p kernal window, 1wx1h stride, relu
# Possibly add MaxPooling (will make it less sensitive to position in image). Camera angle fixed, so may not to be needed
x = Flatten(name='flattened')(x) # Flatten to 1D (Fully connected)
x = Dense(100, activation='relu')(x) # Classify the data into 100 features, make all negatives 0
x = Dropout(.1)(x)
x = Dense(50, activation='relu')(x)
x = Dropout(.1)(x) # Randomly drop out 10% of the neurons (Prevent overfitting)
#categorical output of the angle
callbacks_list = [save_best, early_stop, logs]
angle_out = Dense(15, activation='softmax', name='angle_out')(x) # Connect every input with every output and output 15 hidden units. Use Softmax to give percentage. 15 categories and find best one based off percentage 0.0-1.0
#continous output of throttle
throttle_out = Dense(1, activation='relu', name='throttle_out')(x) # Reduce to 1 number, Positive number only
angle_cat_array = np.array([linear_bin(a) for a in angle_array])
model = Model(inputs=[img_in], outputs=[angle_out, throttle_out])
model.compile(optimizer='adam',
loss={'angle_out': 'categorical_crossentropy',
'throttle_out': 'mean_absolute_error'},
loss_weights={'angle_out': 0.9, 'throttle_out': .001})
model.fit({'img_in':images},{'angle_out': angle_cat_array, 'throttle_out': throttle_array}, batch_size=32, epochs=100, verbose=1, validation_split=0.2, shuffle=True, callbacks=callbacks_list)