Training keras model in AWS Sagemaker - amazon-web-services

I have keras training script on my machine. I am experimenting to run my script on AWS sagemaker container. For that I have used below code.
from sagemaker.tensorflow import TensorFlow
est = TensorFlow(
entry_point="caller.py",
source_dir="./",
role='role_arn',
framework_version="2.3.1",
py_version="py37",
instance_type='ml.m5.large',
instance_count=1,
hyperparameters={'batch': 8, 'epochs': 10},
)
est.fit()
here caller.py is my entry point. After executing the above code I am getting keras is not installed. Here is the stacktrace.
Traceback (most recent call last):
File "executor.py", line 14, in <module>
est.fit()
File "/home/thasin/Documents/python/venv/lib/python3.8/site-packages/sagemaker/estimator.py", line 682, in fit
self.latest_training_job.wait(logs=logs)
File "/home/thasin/Documents/python/venv/lib/python3.8/site-packages/sagemaker/estimator.py", line 1625, in wait
self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
File "/home/thasin/Documents/python/venv/lib/python3.8/site-packages/sagemaker/session.py", line 3681, in logs_for_job
self._check_job_status(job_name, description, "TrainingJobStatus")
File "/home/thasin/Documents/python/venv/lib/python3.8/site-packages/sagemaker/session.py", line 3240, in _check_job_status
raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job tensorflow-training-2021-06-09-07-14-01-778: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/usr/local/bin/python3.7 caller.py --batch 4 --epochs 10
ModuleNotFoundError: No module named 'keras'
Which instance has pre-installed keras?
Is there any way I can install the python package to the AWS container? or any workaround for the issue?
Note: I have tried with my own container uploading to ECR and successfully run my code. I am looking for AWS's existing container capability.

Keras is now part of tensorflow, so you can just reformat your code to use tf.keras instead of keras. Since version 2.3.0 of tensorflow they are in sync, so it should not be that difficult.
You container is this, as you can see from the list of the packages, there is no Keras.
If you instead want to extend a pre-built container you can take a look here but I don't recommend in this specific use-case, because also for future code maintainability you should go for tf.keras

Related

Running Taurus BlazeMeter on AWS Lambda

I am trying to run a BlazeMeter Taurus script with a JMeter script inside via AWS Lambda. I'm hoping that there is a way to run bzt via a local installation in /tmp/bzt instead of looking for a bzt installation on the system which doesn't really exist since its lambda.
This is my lambda_handler.py:
import subprocess
import json
def run_taurus_test(event, context):
subprocess.call(['mkdir', '/tmp/bzt/'])
subprocess.call(['pip', 'install', '--target', '/tmp/bzt/', 'bzt'])
# subprocess.call('ls /tmp/bzt/bin'.split())
subprocess.call(['/tmp/bzt/bin/bzt', 'tests/taurus_test.yaml'])
return {
'statusCode': 200,
'body': json.dumps('Executing Taurus Test hopefully!')
}
The taurus_test.yaml runs as expected when testing on my computer with bzt installed via pip normally, so I know the issue isn't with the test script. The same traceback as below appears if I uninstall bzt from my system and try use a local installation targeted in a certain directory.
This is the traceback in the execution results:
Traceback (most recent call last):
File "/tmp/bzt/bin/bzt", line 5, in <module>
from bzt.cli import main
ModuleNotFoundError: No module named 'bzt'
It's technically failing in /tmp/bzt/bin/bzt which is the executable that's failing, and I think it is because it's not using the local/targeted installation.
So, I'm hoping there is a way to tell bzt to use keep using the targeted installation in /tmp/bzt instead of calling the executable there and then trying to pass it on to an installation that doesn't exist elsewhere. Feedback if AWS Fargate or EC2 would be better suited for this is also appreciated.
Depending on the size of the bzt package, the solutions are:
Use Lambda Docker recent feature, and this way, what you run locally is what you get on Lambda.
Use Lambda layers (similar to Docker), this layer as the btz module in the python directory as described there
When you package your Lambda, instead of uploading a simple Python file, create a ZIP file containing both: /path/to/zip_root/lambda_handler.py and pip install --target /path/to/zip_root

How to install python packages within Amazon Sagemaker Processing Job?

I am trying to create a Sklearn processing job in Amazon Sagemekar to perform some data transformation of my input data before I do model training.
I wrote a custom python script preprocessing.py which does the needful. I use some python package in this script. Here is the Sagemaker example I followed.
When I try to submit the Processing Job I get an error -
............................Traceback (most recent call last):
File "/opt/ml/processing/input/code/preprocessing.py", line 6, in <module>
import snowflake.connector
ModuleNotFoundError: No module named 'snowflake.connector'
I understand that my processing job is unable to find this package and I need to install it. My question is how can I accomplish this using Sagemaker Processing Job API? Ideally there should be a way to define a requirements.txt in the API call, but I don't see such functionality in the docs.
I know I can create a custom Image with relevant packages and later use this image in the Processing Job, but this seems too much work for something that should be built-in?
Is there an easier/elegant way to install packages needed in Sagemaker Processing Job ?
One way would be to call pip from Python:
subprocess.check_call([sys.executable, "-m", "pip", "install", package])
Another way would be to use an SKLearn Estimator (training job) instead, to do the same thing. You can provide the source_dir, which can include a requirements.txt file, and these requirements will be installed for you
estimator = SKLearn(
entry_point="foo.py",
source_dir="./foo", # no trailing slash! put requirements.txt here
framework_version="0.23-1",
role = ...,
instance_count = 1,
instance_type = "ml.m5.large"
)

googlecloudsdk.core.exceptions.RequiresAdminRightsError: You cannot perform this action because you do not have permission

I am trying to install google-cloud-sdk in ubuntu-18.04. I am following offical docs given here. When I run ./google-cloud-sdk/install.sh I get following error:-
Welcome to the Google Cloud SDK!
To help improve the quality of this product, we collect anonymized usage data
and anonymized stacktraces when crashes are encountered; additional information
is available at <https://cloud.google.com/sdk/usage-statistics>. This data is
handled in accordance with our privacy policy
<https://policies.google.com/privacy>. You may choose to opt in this
collection now (by choosing 'Y' at the below prompt), or at any time in the
future by running the following command:
gcloud config set disable_usage_reporting false
Do you want to help improve the Google Cloud SDK (y/N)? N
Traceback (most recent call last):
File "/home/vineet/./google-cloud-sdk/bin/bootstrapping/install.py", line 225, in <module>
main()
File "/home/vineet/./google-cloud-sdk/bin/bootstrapping/install.py", line 200, in main
Prompts(pargs.usage_reporting)
File "/home/vineet/./google-cloud-sdk/bin/bootstrapping/install.py", line 123, in Prompts
scope=properties.Scope.INSTALLATION)
File "/home/vineet/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 2406, in PersistProperty
config.EnsureSDKWriteAccess()
File "/home/vineet/google-cloud-sdk/lib/googlecloudsdk/core/config.py", line 198, in EnsureSDKWriteAccess
raise exceptions.RequiresAdminRightsError(sdk_root)
googlecloudsdk.core.exceptions.RequiresAdminRightsError: You cannot perform this action because you do not have permission to modify the Google Cloud SDK installation directory [/home/vineet/google-cloud-sdk].
Re-run the command with sudo: sudo /home/vineet/google-cloud-sdk/bin/gcloud ...
I tried to search it on stackoverflow and github-issues but in vain.
Would appreciate any hint to solve it.
As stated on the error message.
Re-run the command with sudo: sudo /home/vineet/google-cloud-sdk/bin/gcloud ...
The install.sh script should be run using sudo.
There are also other alternatives to install the Google Cloud SDK in Ubuntu 18.04 just as installing the package with apt-get as explained on the documentation.

AWS DataPipeline Maching Learning AMI tensorflow issues

I'm running the AWS Machine Learning AMI on an EC2 instance. I've confirmed that from the terminal, both in python and jupyter can run
import tensorflow as tf
along with
python pytest.py
from the terminal (which contains the above tensorflow import), with no issues.
I'm now trying to automate my script using DataPipeline along with TaskRunner. The bash command in DataPipeline is again, just:
python pytest.py
However, Immediately get the following error:
Traceback (most recent call last): File "pytest.py", line 1, in
import tensorflow as tf File "/usr/lib/python2.7/dist-packages/tensorflow/init.py", line 24, in
from tensorflow.python import * File "/usr/lib/python2.7/dist-packages/tensorflow/python/init.py", line
72, in
raise ImportError(msg) ImportError: Traceback (most recent call last): File
"/usr/lib/python2.7/dist-packages/tensorflow/python/init.py", line
61, in
from tensorflow.python import pywrap_tensorflow File "/usr/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow.py",
line 28, in
_pywrap_tensorflow = swig_import_helper() File "/usr/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow.py",
line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow', fp, pathname, description) ImportError: libcudart.so.7.5: cannot open shared object
file: No such file or directory
Failed to load the native TensorFlow runtime.
See
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/get_started/os_setup.md#import_error
for some common reasons and solutions. Include the entire stack trace
above this error message when asking for help.
It seems like AWS DataPipeline (or TaskRunner?) uses a different enviornment setup, because again, I have no issues running the script through an ssh terminal to the instance. I found a few posts which suggested adding cuda to the LD_LIBRARY_PATH, but the AMI instance already has it:
echo $LD_LIBRARY_PATH
/home/ec2-user/src/torch/install/lib:/home/ec2-user/src/cntk/bindings/python/cntk/libs:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/home/ec2-user/src/mxnet/mklml_lnx_2017.0.1.20161005/lib:
which clearly contains the cuda librarypath that tensorflow needs.

gsutil: ImportError: No module named google

As of Friday 11th of February, 2016, gsutil has suddenly stopped working. I run nightly backups using gsutil, and prior to executing I perform a gcloud components update.
$ gsutil --version
Traceback (most recent call last):
File "/home/IRUser/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 12, in <module>
import bootstrapping
File "/home/IRUser/google-cloud-sdk/bin/bootstrapping/bootstrapping.py", line 9, in <module>
import setup
File "/home/IRUser/google-cloud-sdk/bin/bootstrapping/setup.py", line 41, in <module>
reload(google)
ImportError: No module named google
If I manually pip install google, gsutil works fine again. However, I question that this somehow wasn't performed by gcloud components update.
My question: Isn't gcloud components update supposed to take care of any such dependencies?
I'm on CentOS 7.
This issue has been reported https://code.google.com/p/google-cloud-sdk/issues/detail?id=538
"google" package was included in previous releases of cloud sdk, but it is no longer needed.
On python installations (which have protobuf installed) "google" package is auto-imported on the startup the reload of existing google package can fail.
By installing it "google" with pip you made reload stop complaining about the module, even though it is not used.
Alternatively you can apply patches suggested in the above issue log.