AWS DataPipeline Maching Learning AMI tensorflow issues - amazon-web-services

I'm running the AWS Machine Learning AMI on an EC2 instance. I've confirmed that from the terminal, both in python and jupyter can run
import tensorflow as tf
along with
python pytest.py
from the terminal (which contains the above tensorflow import), with no issues.
I'm now trying to automate my script using DataPipeline along with TaskRunner. The bash command in DataPipeline is again, just:
python pytest.py
However, Immediately get the following error:
Traceback (most recent call last): File "pytest.py", line 1, in
import tensorflow as tf File "/usr/lib/python2.7/dist-packages/tensorflow/init.py", line 24, in
from tensorflow.python import * File "/usr/lib/python2.7/dist-packages/tensorflow/python/init.py", line
72, in
raise ImportError(msg) ImportError: Traceback (most recent call last): File
"/usr/lib/python2.7/dist-packages/tensorflow/python/init.py", line
61, in
from tensorflow.python import pywrap_tensorflow File "/usr/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow.py",
line 28, in
_pywrap_tensorflow = swig_import_helper() File "/usr/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow.py",
line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow', fp, pathname, description) ImportError: libcudart.so.7.5: cannot open shared object
file: No such file or directory
Failed to load the native TensorFlow runtime.
See
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/get_started/os_setup.md#import_error
for some common reasons and solutions. Include the entire stack trace
above this error message when asking for help.
It seems like AWS DataPipeline (or TaskRunner?) uses a different enviornment setup, because again, I have no issues running the script through an ssh terminal to the instance. I found a few posts which suggested adding cuda to the LD_LIBRARY_PATH, but the AMI instance already has it:
echo $LD_LIBRARY_PATH
/home/ec2-user/src/torch/install/lib:/home/ec2-user/src/cntk/bindings/python/cntk/libs:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/home/ec2-user/src/mxnet/mklml_lnx_2017.0.1.20161005/lib:
which clearly contains the cuda librarypath that tensorflow needs.

Related

Installation failed with standard install instructions

Following the instructions in the documentation, I attempt to create my new project, and get the following error:
EDIT: I should not post questions late at night. Added more detail to the terminal output. Prior to this, I verified pip was upgraded, djangocms-installer is installed and the virtualenv was installed.
(djangoenv) [ec2-user#web01 ~]$ djangocms jbi
Creating the project
Please wait while I install dependencies
If I am stuck for a long time, please check for connectivity / PyPi issues
Dependencies installed
Creating the project
The installation has failed.
*****************************************************************
Check documentation at https://djangocms-installer.readthedocs.io
*****************************************************************
Traceback (most recent call last):
File "/home/ec2-user/djangoenv/bin/djangocms", line 8, in <module>
sys.exit(execute())
File "/home/ec2-user/djangoenv/lib/python3.7/site-packages/djangocms_installer/main.py", line 44, in execute
django.setup_database(config_data)
File "/home/ec2-user/djangoenv/lib/python3.7/site-packages/djangocms_installer/django/__init__.py", line 353, in setup_database
output = subprocess.check_output(command, env=env, stderr=subprocess.STDOUT)
File "/usr/lib64/python3.7/subprocess.py", line 411, in check_output
**kwargs).stdout
File "/usr/lib64/python3.7/subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/home/ec2-user/djangoenv/bin/python', '-W', 'ignore', 'manage.py', 'migrate']' returned non-zero exit status 1.
Using Django inside a virtual environment created on an AWS EC2 instance, running Amazon Linux 2. I'm OK with burning the instance down and using another distro if the issue is the distro.

Training keras model in AWS Sagemaker

I have keras training script on my machine. I am experimenting to run my script on AWS sagemaker container. For that I have used below code.
from sagemaker.tensorflow import TensorFlow
est = TensorFlow(
entry_point="caller.py",
source_dir="./",
role='role_arn',
framework_version="2.3.1",
py_version="py37",
instance_type='ml.m5.large',
instance_count=1,
hyperparameters={'batch': 8, 'epochs': 10},
)
est.fit()
here caller.py is my entry point. After executing the above code I am getting keras is not installed. Here is the stacktrace.
Traceback (most recent call last):
File "executor.py", line 14, in <module>
est.fit()
File "/home/thasin/Documents/python/venv/lib/python3.8/site-packages/sagemaker/estimator.py", line 682, in fit
self.latest_training_job.wait(logs=logs)
File "/home/thasin/Documents/python/venv/lib/python3.8/site-packages/sagemaker/estimator.py", line 1625, in wait
self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
File "/home/thasin/Documents/python/venv/lib/python3.8/site-packages/sagemaker/session.py", line 3681, in logs_for_job
self._check_job_status(job_name, description, "TrainingJobStatus")
File "/home/thasin/Documents/python/venv/lib/python3.8/site-packages/sagemaker/session.py", line 3240, in _check_job_status
raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job tensorflow-training-2021-06-09-07-14-01-778: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/usr/local/bin/python3.7 caller.py --batch 4 --epochs 10
ModuleNotFoundError: No module named 'keras'
Which instance has pre-installed keras?
Is there any way I can install the python package to the AWS container? or any workaround for the issue?
Note: I have tried with my own container uploading to ECR and successfully run my code. I am looking for AWS's existing container capability.
Keras is now part of tensorflow, so you can just reformat your code to use tf.keras instead of keras. Since version 2.3.0 of tensorflow they are in sync, so it should not be that difficult.
You container is this, as you can see from the list of the packages, there is no Keras.
If you instead want to extend a pre-built container you can take a look here but I don't recommend in this specific use-case, because also for future code maintainability you should go for tf.keras

ImportError: No module named idlelib" when running Google Dataflow worker

I have a python 2.7 script I run locally to launch a Apache Beam / Google Dataflow job (SDK 2.12.0). The job takes a csv file from a Google storage bucket, processes it and then creates an entity in Google Datastore for each row. The script ran fine for years ...but now it is failing:
INFO:root:2019-05-15T22:07:11.481Z: JOB_MESSAGE_DETAILED: Workers have started successfully.
INFO:root:2019-05-15T21:47:13.370Z: JOB_MESSAGE_ERROR: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 773, in run
self._load_main_session(self.local_staging_directory)
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 489, in _load_main_session
pickler.load_session(session_file)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 280, in load_session
return dill.load_session(file_path)
File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 410, in load_session
module = unpickler.load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatch[key](self)
File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
value = func(*args)
File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 827, in _import_module
return __import__(import_name)
ImportError: No module named idlelib
I believe this error is happening at the worker level (not locally). I don't make reference to it in my script. To make sure it wasn't me I have installed updates for all google-cloud packages, apache-beam[gcp] etc locally -just in case. I tried importing idlelib into my script I get the same error. Any suggestions?
It has been fine for years and started failing from SDK 2.12.0 release.
What was the last release that this script succeeding on? 2.11?

Can not import a manual package into Google Cloud ML Engine

I created my own package to be used in the Google Cloud ML Engine job.
In order to use my package, I follow the instructions in the official Google Cloud documentation.
That is, I archived my package into tar.gz and upload it into a Cloud Storage bucket.
Next, I start my job and get the following error:
Traceback (most recent call last): File "<string>", line 1, in <module> File "/usr/lib/python3.5/tokenize.py", line 454, in open buffer = _builtin_open(filename, 'rb') FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-req-build-iprehs_0/setup.py
Please advise me the algorithm for importing manual packages, or share a link where this is done.

AWS command line tools broken : (

I tried to install awscli after ebcli, and they both broke. Currently, if I type aws s3 ls, it just hangs with no response, and if I try to use eb, I get this error:
Traceback (most recent call last):
File "/usr/local/bin/eb", line 11, in <module>
load_entry_point('awsebcli==3.8.4', 'console_scripts', 'eb')()
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 565, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2631, in load_entry_point
return ep.load()
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2291, in load
return self.resolve()
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2297, in resolve
module = __import__(self.module_name, fromlist=['__name__'], level=0)
File "/usr/local/lib/python2.7/dist-packages/ebcli/core/ebcore.py", line 43, in <module>
from . import ebglobals, base, io, hooks
File "/usr/local/lib/python2.7/dist-packages/ebcli/core/base.py", line 19, in <module>
from ebcli import __version__
ImportError: cannot import name __version__
I basically need to have command line tools for s3 and elastic beanstalk, but I apparently have no luck, and will be spending my entire day googling the universe, and combing through error codes to try and fix this : (
I'm on Ubuntu 14.04 on a Thinkpad.
It is quite common for different Python libraries to install over each other, causing problems like this.
A popular fix is to use a the virtualenv tool to create isolated Python environments.
The AWS documentation for awsebcli has a page showing how: Install the EB CLI in a Virtual Environment
Alternatively, keep using the AWS Command-Line Interface (CLI) since it works across all AWS services, rather than using service-specific command sets like awsebcli (which pre-date the CLI).