How to install python packages within Amazon Sagemaker Processing Job? - amazon-web-services

I am trying to create a Sklearn processing job in Amazon Sagemekar to perform some data transformation of my input data before I do model training.
I wrote a custom python script preprocessing.py which does the needful. I use some python package in this script. Here is the Sagemaker example I followed.
When I try to submit the Processing Job I get an error -
............................Traceback (most recent call last):
File "/opt/ml/processing/input/code/preprocessing.py", line 6, in <module>
import snowflake.connector
ModuleNotFoundError: No module named 'snowflake.connector'
I understand that my processing job is unable to find this package and I need to install it. My question is how can I accomplish this using Sagemaker Processing Job API? Ideally there should be a way to define a requirements.txt in the API call, but I don't see such functionality in the docs.
I know I can create a custom Image with relevant packages and later use this image in the Processing Job, but this seems too much work for something that should be built-in?
Is there an easier/elegant way to install packages needed in Sagemaker Processing Job ?

One way would be to call pip from Python:
subprocess.check_call([sys.executable, "-m", "pip", "install", package])
Another way would be to use an SKLearn Estimator (training job) instead, to do the same thing. You can provide the source_dir, which can include a requirements.txt file, and these requirements will be installed for you
estimator = SKLearn(
entry_point="foo.py",
source_dir="./foo", # no trailing slash! put requirements.txt here
framework_version="0.23-1",
role = ...,
instance_count = 1,
instance_type = "ml.m5.large"
)

Related

Running Taurus BlazeMeter on AWS Lambda

I am trying to run a BlazeMeter Taurus script with a JMeter script inside via AWS Lambda. I'm hoping that there is a way to run bzt via a local installation in /tmp/bzt instead of looking for a bzt installation on the system which doesn't really exist since its lambda.
This is my lambda_handler.py:
import subprocess
import json
def run_taurus_test(event, context):
subprocess.call(['mkdir', '/tmp/bzt/'])
subprocess.call(['pip', 'install', '--target', '/tmp/bzt/', 'bzt'])
# subprocess.call('ls /tmp/bzt/bin'.split())
subprocess.call(['/tmp/bzt/bin/bzt', 'tests/taurus_test.yaml'])
return {
'statusCode': 200,
'body': json.dumps('Executing Taurus Test hopefully!')
}
The taurus_test.yaml runs as expected when testing on my computer with bzt installed via pip normally, so I know the issue isn't with the test script. The same traceback as below appears if I uninstall bzt from my system and try use a local installation targeted in a certain directory.
This is the traceback in the execution results:
Traceback (most recent call last):
File "/tmp/bzt/bin/bzt", line 5, in <module>
from bzt.cli import main
ModuleNotFoundError: No module named 'bzt'
It's technically failing in /tmp/bzt/bin/bzt which is the executable that's failing, and I think it is because it's not using the local/targeted installation.
So, I'm hoping there is a way to tell bzt to use keep using the targeted installation in /tmp/bzt instead of calling the executable there and then trying to pass it on to an installation that doesn't exist elsewhere. Feedback if AWS Fargate or EC2 would be better suited for this is also appreciated.
Depending on the size of the bzt package, the solutions are:
Use Lambda Docker recent feature, and this way, what you run locally is what you get on Lambda.
Use Lambda layers (similar to Docker), this layer as the btz module in the python directory as described there
When you package your Lambda, instead of uploading a simple Python file, create a ZIP file containing both: /path/to/zip_root/lambda_handler.py and pip install --target /path/to/zip_root

Training keras model in AWS Sagemaker

I have keras training script on my machine. I am experimenting to run my script on AWS sagemaker container. For that I have used below code.
from sagemaker.tensorflow import TensorFlow
est = TensorFlow(
entry_point="caller.py",
source_dir="./",
role='role_arn',
framework_version="2.3.1",
py_version="py37",
instance_type='ml.m5.large',
instance_count=1,
hyperparameters={'batch': 8, 'epochs': 10},
)
est.fit()
here caller.py is my entry point. After executing the above code I am getting keras is not installed. Here is the stacktrace.
Traceback (most recent call last):
File "executor.py", line 14, in <module>
est.fit()
File "/home/thasin/Documents/python/venv/lib/python3.8/site-packages/sagemaker/estimator.py", line 682, in fit
self.latest_training_job.wait(logs=logs)
File "/home/thasin/Documents/python/venv/lib/python3.8/site-packages/sagemaker/estimator.py", line 1625, in wait
self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
File "/home/thasin/Documents/python/venv/lib/python3.8/site-packages/sagemaker/session.py", line 3681, in logs_for_job
self._check_job_status(job_name, description, "TrainingJobStatus")
File "/home/thasin/Documents/python/venv/lib/python3.8/site-packages/sagemaker/session.py", line 3240, in _check_job_status
raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job tensorflow-training-2021-06-09-07-14-01-778: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/usr/local/bin/python3.7 caller.py --batch 4 --epochs 10
ModuleNotFoundError: No module named 'keras'
Which instance has pre-installed keras?
Is there any way I can install the python package to the AWS container? or any workaround for the issue?
Note: I have tried with my own container uploading to ECR and successfully run my code. I am looking for AWS's existing container capability.
Keras is now part of tensorflow, so you can just reformat your code to use tf.keras instead of keras. Since version 2.3.0 of tensorflow they are in sync, so it should not be that difficult.
You container is this, as you can see from the list of the packages, there is no Keras.
If you instead want to extend a pre-built container you can take a look here but I don't recommend in this specific use-case, because also for future code maintainability you should go for tf.keras

Custom PyPI repo for Google Dataflow workers

I want to use a custom pypi repo for my Dataflow workers. Typically, to configure a custom pypi repo, you would edit /etc/pip.conf to look like this:
[global]
index-url = https://pypi.customer.com/
Since I can't run a startup script for Dataflow workers, my thought was to perform this operation at the head of my setup.py file, so that as the script executes, it would update /etc/pip.conf before attempting a pip install of the dependencies.
My setup.py looks like the following:
with open('/etc/pip.conf', 'w') as pip_conf:
pip_conf.write("""
[global]
index-url = https://artifactory.mayo.edu/artifactory/api/pypi/pypi-remote/simple
""")
REQUIRED_PACKAGES = [
'custom_package',
]
setuptools.setup(
name='wordcount',
version='0.0.1',
description='demo package.',
install_requires=REQUIRED_PACKAGES,
packages=setuptools.find_packages())
The odd thing is that my workers are hanging indefinitely. When I ssh into them, I see some Docker containers running, but I am not sure how to debug further.
Any suggestions on how I can hack the Dataflow workers to use a custom pypi url?
This is likely a good candidate for custom containers, where you can install everything exactly as you want rather than having to hack the worker startup sequence.

What to define as entrypoint when initializing a pytorch estimator with a custom docker image for training on AWS Sagemaker?

So I created a docker image for training. In the dockerfile I have an entrypoint defined such that when docker run is executed, it will start running my python code.
To use this on aws sagemaker in my understanding I need to create a pytorch estimator in a jupyter notebook in sagemaker. I tried something like this:
import sagemaker
from sagemaker.pytorch import PyTorch
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
estimator = PyTorch(entry_point='train.py',
role=role,
framework_version='1.3.1',
image_name='xxx.ecr.eu-west-1.amazonaws.com/xxx:latest',
train_instance_count=1,
train_instance_type='ml.p3.xlarge',
hyperparameters={})
estimator.fit({})
In the documentation I found that as image name I can specify the link the my docker image on aws ecr. When I try to execute this it keeps complaining
[Errno 2] No such file or directory: 'train.py'
It complains immidiatly, so surely I am doing something completely wrong. I would expect that first my docker image should run, and than it could find out that the entry point does not exist.
But besides this, why do I need to specify an entry point, as in, should it not be clear that the entry to my training is simply docker run?
For maybe better understanding. The entrypoint python file in my docker image looks like this:
if __name__=='__main__':
parser = argparse.ArgumentParser()
# Hyperparameters sent by the client are passed as command-line arguments to the script.
parser.add_argument('--epochs', type=int, default=5)
parser.add_argument('--batch_size', type=int, default=16)
parser.add_argument('--learning_rate', type=float, default=0.0001)
# Data and output directories
parser.add_argument('--output_data_dir', type=str, default=os.environ['OUTPUT_DATA_DIR'])
parser.add_argument('--train_data_path', type=str, default=os.environ['CHANNEL_TRAIN'])
parser.add_argument('--valid_data_path', type=str, default=os.environ['CHANNEL_VALID'])
# Start training
...
Later I would like to specify the hyperparameters and data channels. But for now I simply do not understand what to put as entry point. In the documentation it says that the entrypoint is required and it should be a local/global path to the entrypoint...
If you really would like to use a complete separate by yourself build docker image, you should create an Amazon Sagemaker algorithm (which is one of the options in the Sagemaker menu). Here you have to specify a link to your docker image on amazon ECR as well as the input parameters and data channels etc. When choosing this options, you should not use the PyTorch estimater but the Algoritm estimater. This way you indeed don't have to specify an entrypoint because it simple runs the docker when training and the default entrypoint can be defined in your docker file.
The Pytorch estimator can be used when having you own model code, but you would like to run this code in an off-the-shelf Sagemaker PyTorch docker image. This is why you have to for example specify the PyTorch framework version. In this case the entrypoint file by default should be placed next to where your jupyter notebook is stored (just upload the file by clicking on the upload button). The PyTorch estimator inherits all options from the framework estimator where options can be found where to place the entrypoint and model, for example source_dir.

Sklearn on aws lambda

I want to use sklearn on AWS lambda. sklearn has dependencies on scipy(173MB) and numpy(75MB). The combined size of all these packages exceeds AWS Lambda disk space limit of 256 MB.
How can I use AWS lambda to use sklearn?
This guy gets it down to 40MB, though I have not tried it myself yet.
The relevant Github repo.
there is a two way to do this
1) installing the modules dynamically
2) aws batch
1) installing the modules dynamically
def lambdahandler():
#install numpy package
# numpy code
#uninstall numpy package
## now install Scipy package
# execute scipy code
or vice versa depends on your code
2) using Aws batch
This is the best way where you don't have any limitation regarding Memory space.
just you need to build a Docker image and need to write an all required packages and libraries inside the requirement.txt file.
I wanted to do the same, and it was very difficult indeed. I ended up buying this layer that includes scikit-learn, pandas, numpy and scipy.
https://www.awslambdas.com/layers/3/aws-lambda-scikit-learn-numpy-scipy-python38-layer
There is another layer that includes xgboost as well.