What to define as entrypoint when initializing a pytorch estimator with a custom docker image for training on AWS Sagemaker? - amazon-web-services

So I created a docker image for training. In the dockerfile I have an entrypoint defined such that when docker run is executed, it will start running my python code.
To use this on aws sagemaker in my understanding I need to create a pytorch estimator in a jupyter notebook in sagemaker. I tried something like this:
import sagemaker
from sagemaker.pytorch import PyTorch
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
estimator = PyTorch(entry_point='train.py',
role=role,
framework_version='1.3.1',
image_name='xxx.ecr.eu-west-1.amazonaws.com/xxx:latest',
train_instance_count=1,
train_instance_type='ml.p3.xlarge',
hyperparameters={})
estimator.fit({})
In the documentation I found that as image name I can specify the link the my docker image on aws ecr. When I try to execute this it keeps complaining
[Errno 2] No such file or directory: 'train.py'
It complains immidiatly, so surely I am doing something completely wrong. I would expect that first my docker image should run, and than it could find out that the entry point does not exist.
But besides this, why do I need to specify an entry point, as in, should it not be clear that the entry to my training is simply docker run?
For maybe better understanding. The entrypoint python file in my docker image looks like this:
if __name__=='__main__':
parser = argparse.ArgumentParser()
# Hyperparameters sent by the client are passed as command-line arguments to the script.
parser.add_argument('--epochs', type=int, default=5)
parser.add_argument('--batch_size', type=int, default=16)
parser.add_argument('--learning_rate', type=float, default=0.0001)
# Data and output directories
parser.add_argument('--output_data_dir', type=str, default=os.environ['OUTPUT_DATA_DIR'])
parser.add_argument('--train_data_path', type=str, default=os.environ['CHANNEL_TRAIN'])
parser.add_argument('--valid_data_path', type=str, default=os.environ['CHANNEL_VALID'])
# Start training
...
Later I would like to specify the hyperparameters and data channels. But for now I simply do not understand what to put as entry point. In the documentation it says that the entrypoint is required and it should be a local/global path to the entrypoint...

If you really would like to use a complete separate by yourself build docker image, you should create an Amazon Sagemaker algorithm (which is one of the options in the Sagemaker menu). Here you have to specify a link to your docker image on amazon ECR as well as the input parameters and data channels etc. When choosing this options, you should not use the PyTorch estimater but the Algoritm estimater. This way you indeed don't have to specify an entrypoint because it simple runs the docker when training and the default entrypoint can be defined in your docker file.
The Pytorch estimator can be used when having you own model code, but you would like to run this code in an off-the-shelf Sagemaker PyTorch docker image. This is why you have to for example specify the PyTorch framework version. In this case the entrypoint file by default should be placed next to where your jupyter notebook is stored (just upload the file by clicking on the upload button). The PyTorch estimator inherits all options from the framework estimator where options can be found where to place the entrypoint and model, for example source_dir.

Related

How to install python packages within Amazon Sagemaker Processing Job?

I am trying to create a Sklearn processing job in Amazon Sagemekar to perform some data transformation of my input data before I do model training.
I wrote a custom python script preprocessing.py which does the needful. I use some python package in this script. Here is the Sagemaker example I followed.
When I try to submit the Processing Job I get an error -
............................Traceback (most recent call last):
File "/opt/ml/processing/input/code/preprocessing.py", line 6, in <module>
import snowflake.connector
ModuleNotFoundError: No module named 'snowflake.connector'
I understand that my processing job is unable to find this package and I need to install it. My question is how can I accomplish this using Sagemaker Processing Job API? Ideally there should be a way to define a requirements.txt in the API call, but I don't see such functionality in the docs.
I know I can create a custom Image with relevant packages and later use this image in the Processing Job, but this seems too much work for something that should be built-in?
Is there an easier/elegant way to install packages needed in Sagemaker Processing Job ?
One way would be to call pip from Python:
subprocess.check_call([sys.executable, "-m", "pip", "install", package])
Another way would be to use an SKLearn Estimator (training job) instead, to do the same thing. You can provide the source_dir, which can include a requirements.txt file, and these requirements will be installed for you
estimator = SKLearn(
entry_point="foo.py",
source_dir="./foo", # no trailing slash! put requirements.txt here
framework_version="0.23-1",
role = ...,
instance_count = 1,
instance_type = "ml.m5.large"
)

How to use custom Cloud Builders with images from Google Artifact Repository

How do I use a custom builder image in Cloud Build which is stored in a repository in Artifact Registry (instead of Container Registry?)
I have set up a pipeline in Cloud Build where some python code is executed using official python images. As I want to cache my python dependencies, I wanted to create a custom Cloud Builder as shown in the official documentation here.
GCP clearly indicates to switch to Artifact Registry as Container Registry will be replaced by the former. Consequently, I have pushed my docker image to Artifact Registry. I also gave my Cloud Builder Service Account the reader permissions to Artifact Registry.
Using the image in a Cloud Build step like this
steps:
- name: 'europe-west3-docker.pkg.dev/xxxx/yyyy:latest'
id: install_dependencies
entrypoint: pip
args: ["install", "-r", "requirements.txt", "--user"]
throws the following error
Step #0 - "install_dependencies": Pulling image: europe-west3-docker.pkg.dev/xxxx/yyyy:latest
Step #0 - "install_dependencies": Error response from daemon: manifest for europe-west3-docker.pkg.dev/xxxx/yyyy:latest not found: manifest unknown: Requested entity was not found.
"xxxx" is the repository name and "yyyy" the name of my image. The tag "latest" exists.
I can pull the image locally and access the repository.
I could not find any documentation on how to integrate these images from Artifact Registry. There is only this official guide, where the image is built using the Docker image from Container Registry – however this should not be future proof.
It looks like you need to add your Project ID to your image name.
You can use the "$PROJECT_ID" Cloud Build default substitution variable.
So your updated image name would look something like this:
steps:
- name: 'europe-west3-docker.pkg.dev/$PROJECT_ID/xxxx/yyyy:latest'
For more details about substituting variable values in Cloud Build see:
https://cloud.google.com/build/docs/configuring-builds/substitute-variable-values

How to load an .RData file to AWS sagemaker notebook?

I just started using the AWS sagemaker and I have an xgboost model saved in my personal laptop using save as .Rdata, saveRDS, xgb.save commands. I have uploaded those files in my Sagemaker notebook instance where my different notebooks are. However, I am unable to load it to my environment and predict for test data by using the following commands:
load("Model.RData")
model=xgb.load('model')
model <- readRDS("Model.rds")
When I predict, I get NAs as my prediction. These commands work fine on Rstudio but not on sagemaker notebook.Please help

Docker image different size when pushed to ECR than locally

I have a docker image that is 1.46GB on my local machine, but when this is pushed to AWS ECR (either via my local machine or via CicleCI deployment) it is only 537.05MB. I'm pretty new to Docker and to AWS, so any help in figuring out as to why this may be would be appreciated!
I have a feeling that it has not fully uploaded to ECR for whatever reason, as I am trying to use this container for a Batch job, but for some reason the same command which works when used locally does not work when used in the job definition. The command is simply python app.py, but I have also tried with absolute path python /usr/local/src/app/app.py, both of which result in [Errno 2] No such file or directory.
Commands used in my Makefile deployment are as below:
docker build --force-rm=true -t $(EXTRACTOR_IMAGE_NAME) ./extractor
docker tag $(EXTRACTOR_IMAGE_NAME) $(EXTRACTOR_ECR_IMAGE_NAME)
$(shell aws ecr get-login --no-include-email)
docker push ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/$(EXTRACTOR_ECR_REPO)
Edit 1:
I think this might be to do with the size of the base image, which is python:2.7 in this case. The base image is 914MB, plus the size of my ECR image 537.05MB = 1451.05MB, i.e. approx 1.46GB. Still not sure what the issue is with the Batch command though...
Edit 2:
I've been mounting code into my container using a volume, which is why this has been working locally. At build time I've forgotten to copy the code into the container, which I assume is the only reason why this is not working in Batch!
That could be due to how docker client acts before it pushes the image to ECR as documented:
Beginning with Docker version 1.9, the Docker client compresses image layers before pushing them to a V2 Docker registry. The output of the docker images command shows the uncompressed image size, so it may return a larger image size than the image sizes shown in the AWS Management Console.
So when you pull an image you will notice that the image layers go through three stages:
Downloading
Extraction
Completion
Regarding this command: python /usr/local/src/app/app.py are executing it while you inside /usr/local/src/app/ ? You might need to ensure this first also have you checked the command inside the container using the image before you push it as the error seems to be code related more than a docker issue
We can read the following in the the AWS ECR documentation:
Note
Beginning with Docker version 1.9, the Docker client compresses image layers before pushing them to a V2 Docker registry. The output of the docker images command shows the uncompressed image size, so it may return a larger image size than the image sizes shown in the AWS Management Console.
I suspect you'd get the sizes you expect, would you use the CLI (docker images) instead of the ECR web console.

gcloud job can't access my files, either they are in GCS or in my cloud shell

I'm trying to run my code of machine learning from images using tensorflow in Google CloudML. However, it seems the submitted job can't access to my files in my cloud shell or in GCS. Even though it is working fine in my local machine, I get the following error once I submit my job using the command gcloud from the cloud shell:
ERROR 2017-12-19 13:52:28 +0100 service IOError: [Errno 2] No such file or directory: '/home/user/pores-project-googleML/trainer/train.txt'
This folder can be found for sure in cloud shell, and I can check it when I type:
ls /home/user/pores-project-googleML/trainer/train.txt
I tried putting my file train.txt in GCS and access to it from my code (by specifying the path gs://my_bucket/my_path), but once the job submitted, I got a 'No such file or directory' error with the corresponding path.
To check where the job I submitted using gcloud is running, I added print(os.getcwd()) in the beginning of my python code trainer/task.py, which printed as a result in the logs: /user_dir. I couldn't find this path using the cloud shell, not even in GCS. So my question is how can I know in which machine my job is running? If it's in a certain container somewhere, how can I access from it to my files using the cloud shell and in GCS?
Before I do all of this, I succesfully completed the 'Image Classification using Flowers Dataset' tutorial.
The command I used to submit my job is:
gcloud ml-engine jobs submit training $JOB_NAME --job-dir $JOB_DIR --packages trainer-0.1.tar.gz --module-name $MAIN_TRAINER_MODULE --region us-central1
where:
TRAINER_PACKAGE_PATH=/home/use/pores-project-googleML/trainer
MAIN_TRAINER_MODULE="trainer.task"
JOB_DIR="gs://pores/AlexNet_CloudML/job_dir/"
JOB_NAME="census$(date +"%Y%m%d_%H%M%S")"
Regular Python IO library is not able to access files on GCS. Instead, you need to use GCS python client or gstuil cli to access GCS files.
Note that TensorFlow itself has native support of GCS (i.e., it can read GCS files directly).