SageMaker Endpoint stuck at "Creating"

SageMaker Endpoint stuck at "Creating" - amazon-web-services

I'm trying to deploy a SageMaker endpoint and it gets stuck in "Creating" stage indefinitely. Below is my Dockerfile and training / serving script. The model trains without any issue. Only the Endpoint deployment gets stuck in the "Creating" stage.
Below is the folder structure
Folder structure
|_code
|_train_serve.py
|_Dockerfile
Below is the Dockerfile
Dockerfile
# ##########################################################
# Adapt your container (to work with SageMaker)
# # https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html
# # https://hub.docker.com/r/huanjason/scikit-learn/dockerfile
ARG REGION=us-east-1
FROM python:3.7
RUN apt-get update && apt-get -y install gcc
RUN pip3 install \
# numpy==1.16.2 \
numpy \
# scikit-learn==0.20.2 \
scikit-learn \
pandas \
# scipy==1.2.1 \
scipy \
mlflow
RUN rm -rf /root/.cache
ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
# Install sagemaker-training toolkit to enable SageMaker Python SDK
RUN pip3 install sagemaker-training
ENV PATH="/opt/ml/code:${PATH}"
# Copies the training code inside the container
COPY /code /opt/ml/code
# Defines train_serve.py as script entrypoint
ENV SAGEMAKER_PROGRAM train_serve.py
Below is the script used for training and serving the model
train_serve.py
import os
import ast
import warnings
import sys
import json
import ast
import argparse
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import PolynomialFeatures
from urllib.parse import urlparse
import logging
import pickle
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def eval_metrics(actual, pred):
rmse = np.sqrt(mean_squared_error(actual, pred))
mae = mean_absolute_error(actual, pred)
r2 = r2_score(actual, pred)
return rmse, mae, r2
if __name__ =='__main__':
parser = argparse.ArgumentParser()
# hyperparameters sent by the client are passed as command-line arguments to the script.
# Data, model, and output directories
parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
parser.add_argument('--train-file', type=str, default='kc_house_data_train.csv')
parser.add_argument('--test-file', type=str, default='kc_house_data_test.csv')
parser.add_argument('--features', type=str) # we ask user to explicitly name features
parser.add_argument('--target', type=str) # we ask user to explicitly name the target
args, _ = parser.parse_known_args()
warnings.filterwarnings("ignore")
np.random.seed(40)
# Reading training and testing datasets
logging.info('reading training and testing datasets')
logging.info(f"{args.train} {args.train_file} {args.test} {args.test_file}")
train_df = pd.read_csv(os.path.join(args.train, args.train_file))
test_df = pd.read_csv(os.path.join(args.test, args.test_file))
logging.info(args.features.split(','))
logging.info(args.target)
train_x = np.array(train_df[args.features.split(',')]).reshape(-1,1)
test_x = np.array(test_df[args.features.split(',')]).reshape(-1,1)
train_y = np.array(train_df[args.target]).reshape(-1,1)
test_y = np.array(test_df[args.target]).reshape(-1,1)
reg = linear_model.LinearRegression()
reg.fit(train_x, train_y)
predicted_price =
reg.predict(test_x)
(rmse, mae, r2) = eval_metrics(test_y, predicted_price)
logging.info(f" Linear model: (features={args.features}, target={args.target})")
logging.info(f" RMSE: {rmse}")
logging.info(f" MAE: {mae}")
logging.info(f" R2: {r2}")
model_path = os.path.join(args.model_dir, "model.pkl")
logging.info(f"saving to {model_path}")
logging.info(args.model_dir)
with open(model_path, 'wb') as path:
pickle.dump(reg, path)
def model_fn(model_dir):
with open(os.path.join(model_dir, "model.pkl"), "rb") as input_model:
model = pickle.load(input_model)
return model
def predict_fn(input_object, model):
_return = model.predict(input_object)
return _return

One way of investigating this is to attempt to use the same model via the AWS console as part of a Batch Transform, as this flow seems to give better error messaging and diagnostics versus Inference Endpoint creation.
In my case, this made me realise that the IAM role that was associated with the model upon its creation no longer existed. I'd overlooked this because because the roles were CDK-managed and at some point got removed, but the Models were created dynamically via Step Functions pipelines.
Anyway, deploying with a non-existent role would lead to the SageMaker endpoint remaining in the "Creating" state for a few hours, before failing with "Request to service failed. If failure persists after retry, contact customer support", and there would be no CloudWatch logs. Re-creating the model with a valid role fixed the issue.
Apologies if the above does not apply to the OP, who reports the same problem but with a different setup that I am not familiar with. I am just sharing my outcome with a similar problem which brought me to this page, in case it helps anyone in future.

Related

Issue with django crontabs there not working

hello guys i trying to use django_crontab on my django project and there not working does anyone know something about this im using Linux centos 8. I want to schedule a task to add some data to my database. Can someone help me
The steps that i have take is:
pip install django-crontab
add to the installed apps
build my cron function
` from django.core.management.base import BaseCommand
from backups.models import Backups
from devices.models import Devices
from datetime import datetime
from jnpr.junos import Device
from jnpr.junos.exception import ConnectError
from lxml import etree
from django.http import HttpResponse
from django.core.files import File
class Command(BaseCommand):
def handle(self, *args, **kwargs):
devices = Devices.objects.all()
for x in devices:
devid = Devices.objects.get(pk=x.id)
ip = x.ip_address
username = x.username
password = x.password
print(devid, ip, username, password)
dev1 = Device(host= ip ,user= username, passwd= password)
try:
dev1.open()
stype = "sucsess"
dataset = dev1.rpc.get_config(options={'format':'set'})
datatext = dev1.rpc.get_config(options={'format':'text'})
result = (etree.tostring(dataset, encoding='unicode'))
file_name = f'{ip}_{datetime.now().date()}.txt'
print(file_name)
with open("media/"f'{file_name}','w') as f:
f.write(etree.tostring(dataset, encoding='unicode'))
f.write(etree.tostring(datatext, encoding='unicode'))
backup = Backups(device_id=devid, host=ip, savetype=stype, time=datetime.now(), backuptext=file_name)
print(backup)
backup.save()
except ConnectError as err:
print ("Cannot connect to device: {0}".format(err))
print("----- Faild ----------")
stype = ("Cannot connect to device: {0}".format(err))
backup = Backups(device_id=devid, host=ip, savetype=stype, time=datetime.now())
backup.save()
`
add my cronjob to my setting.py file :
CRONJOBS = [ ('*/5 * * * *', 'django.core.management.call_command', ['backup-dev']), ]
5)
python manage.py crontab add
6)
python manage.py crontab show
Currently active jobs in crontab:
0662c1224789b131740fddef54f273c1 -> ('* * * * *', 'django.core.management.call_command', ['backup-dev'])
and still not working any ideas
and when i run this command: " python manage.py backup-dev" my task working perfectly
i Also try to add the management command direct to the centos machine via crontab with the command
crontab -e
and still nothing any ideas

How to deploy a Pre-Trained model using AWS SageMaker Notebook Instance?

I have a pre-trained model which I am loading in AWS SageMaker Notebook Instance from S3 Bucket and upon providing a test image for prediction from S3 bucket it gives me the accurate results as required. I want to deploy it so that I can have an endpoint which I can further integrate with AWS Lambda Function and AWS API GateWay so that I can use the model with real time application.
Any idea how can I deploy the model from AWS Sagemaker Notebook Instance and get its endpoint?
Code inside the .ipynb file is given below for reference.
import boto3
import pandas as pd
import sagemaker
#from sagemaker import get_execution_role
from skimage.io import imread
from skimage.transform import resize
import numpy as np
from keras.models import load_model
import os
import time
import json
#role = get_execution_role()
role = sagemaker.get_execution_role()
bucketname = 'bucket' # bucket where the model is hosted
filename = 'test_model.h5' # name of the model
s3 = boto3.resource('s3')
image= s3.Bucket(bucketname).download_file(filename, 'test_model_new.h5')
model= 'test_model_new.h5'
model = load_model(model)
bucketname = 'bucket' # name of the bucket where the test image is hosted
filename = 'folder/image.png' # prefix
s3 = boto3.resource('s3')
file= s3.Bucket(bucketname).download_file(filename, 'image.png')
file_name='image.png'
test=np.array([resize(imread(file_name), (137, 310, 3))])
test_predict = model.predict(test)
print ((test_predict > 0.5).astype(np.int))

Here is the solution that worked for me. Simply follow the following steps.
1 - Load your model in the SageMaker's jupyter environment with the help of
from keras.models import load_model
model = load_model (<Your Model name goes here>) #In my case it's model.h5
2 - Now that the model is loaded convert it into the protobuf format that is required by AWS with the help of
def convert_h5_to_aws(loaded_model):
from tensorflow.python.saved_model import builder
from tensorflow.python.saved_model.signature_def_utils import predict_signature_def
from tensorflow.python.saved_model import tag_constants
model_version = '1'
export_dir = 'export/Servo/' + model_version
# Build the Protocol Buffer SavedModel at 'export_dir'
builder = builder.SavedModelBuilder(export_dir)
# Create prediction signature to be used by TensorFlow Serving Predict API
signature = predict_signature_def(
inputs={"inputs": loaded_model.input}, outputs={"score": loaded_model.output})
from keras import backend as K
with K.get_session() as sess:
# Save the meta graph and variables
builder.add_meta_graph_and_variables(
sess=sess, tags=[tag_constants.SERVING], signature_def_map={"serving_default": signature})
builder.save()
import tarfile
with tarfile.open('model.tar.gz', mode='w:gz') as archive:
archive.add('export', recursive=True)
import sagemaker
sagemaker_session = sagemaker.Session()
inputs = sagemaker_session.upload_data(path='model.tar.gz', key_prefix='model')
convert_h5_to_aws(model):
3 - And now you can deploy your model with the help of
!touch train.py
from sagemaker.tensorflow.model import TensorFlowModel
sagemaker_model = TensorFlowModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/model.tar.gz',
role = role,
framework_version = '1.15.2',
entry_point = 'train.py')
%%timelog
predictor = sagemaker_model.deploy(initial_instance_count=1,
instance_type='ml.m4.xlarge')
This will generate the endpoint which can be seen in the Inference section of the Amazon SageMaker and with the help of that endpoint you can now make predictions from the jupyter notebook as well as from web and mobile applications.
This Youtube tutorial by Liam and AWS blog by Priya helped me alot.

Run RASA with flask

I want to run RASA with --enable-api inside the python code rather than the command line. Below is my code which is not working. Let me know how can i do that. The issue is once i hit the service because the channel is 'cmdline' it comes to the command line. I don't know how to resolve this.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import logging
import rasa_core
from rasa_core.agent import Agent
from rasa_core.policies.keras_policy import KerasPolicy
from rasa_core.policies.memoization import MemoizationPolicy
from rasa_core.interpreter import RasaNLUInterpreter
from rasa_core.utils import EndpointConfig
from rasa_core.run import serve_application
from rasa_core import config
from rasa_core.policies.fallback import FallbackPolicy
from rasa_core.policies.keras_policy import KerasPolicy
from flask import Flask
from flask_cors import CORS, cross_origin
app = Flask(__name__)
CORS(app)
logger = logging.getLogger(__name__)
#app.route("/conversations/default/respond",methods=['POST'])
def run_weather_bot(serve_forever=True):
logging.basicConfig(level="ERROR")
interpreter = RasaNLUInterpreter('C:\\xxxx_nlu\\models\\nlu\\default\\weathernlu')
action_endpoint = EndpointConfig(url="http://xxx.xx.xx.xxx:5055/webhook")
agent = Agent.load('C:\\xxxx_nlu\\models\\dialogue', interpreter=interpreter, action_endpoint=action_endpoint)
rasa_core.run.serve_application(agent,channel='cmdline')
return agent
if __name__ == '__main__':
app.run("xxx.xx.xx.xxx",5005,debug=True)

You're calling rasa bot in the command line in your run_weather_bot function using below command.
rasa_core.run.serve_application(agent,channel='cmdline')
As you can see its serving as command line application.
I have made some changes in your code for a conversation with rasa chatbot. You can refer AGENT documentation and Weather bot article for connection of RASA agent and how RASA agent handles the input message.
def rasa_agent():
interpreter = RasaNLUInterpreter("Path for NLU")
action_endpoint = EndpointConfig(url="Webhook URL")
agent = Agent.load('Path to Dialogue', interpreter=interpreter, action_endpoint=action_endpoint)
## Next line runs the rasa in commandline
# rasa_core.run.serve_application(agent,channel='cmdline')
return agent
#app.route("/conversations/default/respond",methods=['POST'])
def run_weather_bot(serve_forever=True):
agent = rasa_agent() # calling rasa agent
## Collect Query from POST request
## Send Query to Agent
## Get Response of BOT
output = {} ## Append output
return jsonify(output)

Python: Erratic joblib behaviour on Flask

I am trying to deploy a machine learning model on AWS EC2 instance using Flask. These are sklearn's fitted Random Forest models that are pickled using joblib. When I host Flask on localhost and load them into memory everything runs smoothly. However, when I deploy it on the apache2 server using mod_wsgi, joblib works sometimes(i.e. the models are loaded using joblib sometimes) and the other times the server just hangs. There is no error in logs. Any ideas would be appreciated.
Here is the relevant code that I am using:
# In[49]:
from flask import Flask, jsonify, request, render_template
from datetime import datetime
from sklearn.externals import joblib
import pickle as pkl
import os
# In[50]:
app = Flask(__name__, template_folder="/home/ubuntu/flaskapp/")
# In[51]:
log = lambda msg: app.logger.info(msg, extra={'worker_id': "request.uuid" })
# Logger
import logging
handler = logging.FileHandler('/home/ubuntu/app.log')
handler.setLevel(logging.ERROR)
app.logger.addHandler(handler)
# In[52]:
#app.route('/')
def host_template():
return render_template('Static_GUI.html')
# In[53]:
def load_models(path):
model_arr = [0]*len(os.listdir(path))
for filename in os.listdir(path):
f = open(path+"/"+filename, 'rb')
model_arr[int(filename[2:])] = joblib.load(f)
print("Classifier ", filename[2:], " added.")
f.close()
return model_arr
# In[54]:
partition_limit = 30
# In[55]:
print("Dictionaries being loaded.")
dict_file_path = "/home/ubuntu/Dictionaries/VARR"
dictionaries = pkl.load(open(dict_file_path, "rb"))
print("Dictionaries Loaded.")
# In[56]:
print("Begin loading classifiers.")
model_path = "/home/ubuntu/RF_Models/"
classifier_arr = load_models(model_path)
print("Classifiers Loaded.")
if __name__ == '__main__':
log("/home/ubuntu/print.log")
print("Starting API")
app.run(debug=True)

I was stuck with this for quite sometime. Posting the answer in case someone runs into this problem. Using print statements and looking at logs I narrowed the problem down to joblib.load statement. I found this awesome blog: http://blog.rtwilson.com/how-to-fix-flask-wsgi-webapp-hanging-when-importing-a-module-such-as-numpy-or-matplotlib
The idea of using a global process group fixed the problem. That forced the use of main interpreter just as the top comment on that blog page mentions.

Scheduled, timestamped sqlite3 .backup?

Running a small db on pythonanywhere, and am trying to set up a scheduled .backup of my sqlite3 database. Is there any way in the command line to add a time/date stamp to the filename, so that it doesn't overwrite the previous days backup?
Here's the code I'm using, if it matters:
sqlite3 db.sqlite3
.backup dbbackup.sqlite3
.quit
Running every 24 hours. The previous day's backup gets overwritten, though. I'd love to just be able to save it as dbbackup.timestamp.sqlite3 or something, so I could have multiple backups available.
Thanks!

I suggest you to handle this case with management commands and cronjob.
This an example how to do it; save this file eg in yourapp/management/commands/dbackup.py, don't forget to add __init__.py files.
yourapp/management/__init__.py
yourapp/management/commands/__init__.py
yourapp/management/commands/dbackup.py
But, previously add these lines below to your settings.py
USERNAME_SUPERUSER = 'yourname`
PASSWORD_SUPERUSER = `yourpassword`
EMAIL_SUPERUSER = `youremail#domain.com`
DATABASE_NAME = 'db.sqlite3'
The important tree path project if you deploying at pythonanywhere;
/home/yourusername/yourproject/manage.py
/home/yourusername/yourproject/db.sqlite3
/home/yourusername/yourproject/yourproject/settings.py
/home/yourusername/yourproject/yourapp/management/commands/dbackup.py
Add these script below into yourapp/management/commands/dbackup.py, you also can custom this script as you need.
import os
import time
from django.conf import settings
from django.contrib.auth.models import User
from django.core.management.base import (BaseCommand, CommandError)
USERNAME_SUPERUSER = settings.USERNAME_SUPERUSER
PASSWORD_SUPERUSER = settings.PASSWORD_SUPERUSER
EMAIL_SUPERUSER = settings.EMAIL_SUPERUSER
DATABASE_NAME = settings.DATABASE_NAME #eg: 'db.sqlite3'
class Command(BaseCommand):
help = ('Command to deploy and backup the latest database.')
def add_arguments(self, parser):
parser.add_argument(
'-b', '--backup', action='store_true',
help='Just backup command confirmation.'
)
def success_info(self, info):
return self.stdout.write(self.style.SUCCESS(info))
def error_info(self, info):
return self.stdout.write(self.style.ERROR(info))
def handle(self, *args, **options):
backup = options['backup']
if backup == False:
return self.print_help()
# Removing media files, if you need to remove all media files
# os.system('rm -rf media/images/')
# self.success_info("[+] Removed media files at `media/images/`")
# Removing database `db.sqlite3`
if os.path.isfile(DATABASE_NAME):
# backup the latest database, eg to: `db.2017-02-03.sqlite3`
backup_database = 'db.%s.sqlite3' % time.strftime('%Y-%m-%d')
os.rename(DATABASE_NAME, backup_database)
self.success_info("[+] Backup the database `%s` to %s" % (DATABASE_NAME, backup_database))
# remove the latest database
os.remove(DATABASE_NAME)
self.success_info("[+] Removed %s" % DATABASE_NAME)
# Removing all files migrations for `yourapp`
def remove_migrations(path):
exclude_files = ['__init__.py', '.gitignore']
path = os.path.join(settings.BASE_DIR, path)
filelist = [
f for f in os.listdir(path)
if f.endswith('.py')
and f not in exclude_files
]
for f in filelist:
os.remove(path + f)
self.success_info('[+] Removed files migrations for {}'.format(path))
# do remove all files migrations
remove_migrations('yourapp/migrations/')
# Removing all `.pyc` files
os.system('find . -name *.pyc -delete')
self.success_info('[+] Removed all *.pyc files.')
# Creating database migrations
# These commands should re-generate the new database, eg: `db.sqlite3`
os.system('python manage.py makemigrations')
os.system('python manage.py migrate')
self.success_info('[+] Created database migrations.')
# Creating a superuser
user = User.objects.create_superuser(
username=USERNAME_SUPERUSER,
password=PASSWORD_SUPERUSER,
email=EMAIL_SUPERUSER
)
user.save()
self.success_info('[+] Created a superuser for `{}`'.format(USERNAME_SUPERUSER))
Setup this command with crontab
$ sudo crontab -e
And add these following below lines;
# [minute] [hour] [date] [month] [year]
59 23 * * * source ~/path/to/yourenv/bin/activate && cd ~/path/to/yourenv/yourproject/ && ./manage.py dbackup -b
But, if you need to deploy at pythonanywhere, you just need to add these..
Daily at [hour] : [minute] UTC, ... fill the hour=23 and minute=59
source /home/yourusername/.virtualenvs/yourenv/bin/activate && cd /home/yourusername/yourproject/ && ./manage.py dbackup -b
Update 1
I suggest you to update the commands to execute the file of manage.py such as os.system('python manage.py makemigrations') with function of call_command;
from django.core.management import call_command
call_command('collectstatic', verbosity=3, interactive=False)
call_command('migrate', 'myapp', verbosity=3, interactive=False)
...is equal to the following commands typed in terminal:
$ ./manage.py collectstatic --noinput -v 3
$ ./manage.py migrate myapp --noinput -v 3
See running management commands from django docs.
Update 2
Previous condition is if you need to re-deploy your project and using a fresh database. But, if you only want to backup it by renaming the database, you can using module of shutil.copyfile
import os
import time
import shutil
from django.conf import settings
from django.core.management.base import (BaseCommand, CommandError)
DATABASE_NAME = settings.DATABASE_NAME #eg: 'db.sqlite3'
class Command(BaseCommand):
help = ('Command to deploy and backup the latest database.')
def add_arguments(self, parser):
parser.add_argument(
'-b', '--backup', action='store_true',
help='Just backup command confirmation.'
)
def success_info(self, info):
return self.stdout.write(self.style.SUCCESS(info))
def error_info(self, info):
return self.stdout.write(self.style.ERROR(info))
def handle(self, *args, **options):
backup = options['backup']
if backup == False:
return self.print_help()
if os.path.isfile(DATABASE_NAME):
# backup the latest database, eg to: `db.2017-02-29.sqlite3`
backup_database = 'db.%s.sqlite3' % time.strftime('%Y-%m-%d')
shutil.copyfile(DATABASE_NAME, backup_database)
self.success_info("[+] Backup the database `%s` to %s" % (DATABASE_NAME, backup_database))

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

SageMaker Endpoint stuck at "Creating" - amazon-web-services

Related

Issue with django crontabs there not working

How to deploy a Pre-Trained model using AWS SageMaker Notebook Instance?

Run RASA with flask

Python: Erratic joblib behaviour on Flask

Scheduled, timestamped sqlite3 .backup?

Categories

Resources