Using tensorflow.contrib.data.Dataset in Cloud ML - google-cloud-platform

Recently I changed my data pipeline in tensorflow from threading to a new Dataset api, which is pretty convenient once you want to validate your model each epoch.
I've noticed that current runtime version of tensorflow in Cloud ML is 1.2. Nevertheless, I've tried to use nightly build of tensorflow v1.3, but pip installation fails with:
AssertionError: tensorflow==1.3.0 .dist-info directory not found
Command '['pip', 'install', '--user', '--upgrade', '--force-reinstall', '--no-deps', u'tensorflow-1.3.0-cp27-none-linux_x86_64.whl']' returned non-zero exit status 2
Does anyone succeeded with using tensorflow.cotrib.data.Dataset with Cloud ML engine?

This worked for me: create a setup.py file with the following content:
from setuptools import find_packages
from setuptools import setup
REQUIRED_PACKAGES = ['tensorflow==1.3.0']
setup(
name='trainer',
version='0.1',
install_requires=REQUIRED_PACKAGES,
packages=find_packages(),
include_package_data=True,
description='Upgrading tf to 1.3')
more info on the setup.py file is available at: Packaging a Training Application

Related

GCP Composer Airflow - unable to install packages using PyPi

I have created a Composer environment with image version -> composer-2.0.13-airflow-2.2.5
when i try to install software using PyPi, it fails.
details below :
Command :
gcloud composer environments update $AIRFLOW --location us-east1 --update-pypi-packages-from-file requirements.txt
requirement.txt
---------------
google-api-core
google-auth
google-auth-oauthlib
google-cloud-bigquery
google-cloud-core
google-cloud-storage
google-crc32c
google-resumable-media
googleapis-common-protos
google-endpoints
joblib
json5
jsonschema
pandas
requests
requests-oauthlib
Error :
Karans-MacBook-Pro:composer_dags karanalang$ gcloud composer environments update $AIRFLOW --location us-east1 --update-pypi-packages-from-file requirements.txt
Waiting for [projects/versa-sml-googl/locations/us-east1/environments/versa-composer3] to be updated with [projects/versa-sml-googl/locations/us-east1/operations/c23b77a9-f46b-4222-bafd-62527bf27239]..
.failed.
ERROR: (gcloud.composer.environments.update) Error updating [projects/versa-sml-googl/locations/us-east1/environments/versa-composer3]: Operation [projects/versa-sml-googl/locations/us-east1/operations/c23b77a9-f46b-4222-bafd-62527bf27239] failed: Failed to install PyPI packages. looker-sdk 22.4.0 has requirement attrs>=20.1.0; python_version >= "3.7", but you have attrs 17.4.0.
Check the Cloud Build log at https://console.cloud.google.com/cloud-build/builds/60ac972a-8f5e-4b4f-a4a7-d81049fb19a3?project=939354532596 for details. For detailed instructions see https://cloud.google.com/composer/docs/troubleshooting-package-installation
Pls note:
I have an older Composer cluster (Composer version - 1.16.8, Airflow version - 1.10.15), where the above command works fine.
However, it is not working with the new cluster
What needs to be done to debug/fix this ?
tia!
I was able to get this working using the following code :
path = "gs://dataproc-spark-configs/pip_install.sh"
CLUSTER_GENERATOR_CONFIG = ClusterGenerator(
project_id=PROJECT_ID,
zone="us-east1-b",
master_machine_type="n1-standard-4",
worker_machine_type="n1-standard-4",
num_workers=4,
storage_bucket="dataproc-spark-logs",
init_actions_uris=[path],
metadata={'PIP_PACKAGES': 'pyyaml requests pandas openpyxl kafka-python'},
).make()
with models.DAG(
'Versa-Alarm-Insights-UsingComposer2',
# Continue to run DAG twice per day
default_args=default_dag_args,
schedule_interval='0 0/12 * * *',
catchup=False,
) as dag:
create_dataproc_cluster = DataprocCreateClusterOperator(
task_id="create_dataproc_cluster",
cluster_name="versa-composer2",
region=REGION,
cluster_config=CLUSTER_GENERATOR_CONFIG
)
The earlier command which involved installing packages by reading from file was working in Composer1 (Airflow 1.x), however failing with Composer 2.x (Airflow 2.x)
From the error, it is clear that you are running old version of attrs package.
run the below command and try
pip install attrs==20.3.0
or
pip install attrs==20.1.0

PyTorch model deployment in AI Platform

I'm deploying a Pytorch model in Google Cloud AI Platform, I'm getting the following error:
ERROR: (gcloud.beta.ai-platform.versions.create) Create Version failed. Bad model detected with error: Model requires more memory than allowed. Please try to decrease the model size and re-deploy. If you continue to have error, please contact Cloud ML.
Configuration:
setup.py
from setuptools import setup
REQUIRED_PACKAGES = ['torch']
setup(
name="iris-custom-model",
version="0.1",
scripts=["model.py"],
install_requires=REQUIRED_PACKAGES
)
Model version creation
MODEL_VERSION='v1'
RUNTIME_VERSION='1.15'
MODEL_CLASS='model.PyTorchIrisClassifier'
!gcloud beta ai-platform versions create {MODEL_VERSION} --model={MODEL_NAME} \
--origin=gs://{BUCKET}/{GCS_MODEL_DIR} \
--python-version=3.7 \
--runtime-version={RUNTIME_VERSION} \
--package-uris=gs://{BUCKET}/{GCS_PACKAGE_URI} \
--prediction-class={MODEL_CLASS}
You need to use Pytorch compiled packages compatible with Cloud AI Platform Package information here
This bucket contains compiled packages for PyTorch that are compatible with Cloud AI Platform prediction. The files are mirrored from the official builds at https://download.pytorch.org/whl/cpu/torch_stable.html
From documentation
In order to deploy a PyTorch model on Cloud AI Platform Online
Predictions, you must add one of these packages to the packageURIs
field on the version you deploy. Pick the package matching your Python
and PyTorch version. The package names follow this template:
Package name =
torch-{TORCH_VERSION_NUMBER}-{PYTHON_VERSION}-linux_x86_64.whl where
PYTHON_VERSION = cp35-cp35m for Python 3 with runtime versions <
1.15, cp37-cp37m for Python 3 with runtime versions >= 1.15
For example, if I were to deploy a PyTorch model based on PyTorch
1.1.0 and Python 3, my gcloud command would look like:
gcloud beta ai-platform versions create {VERSION_NAME} --model {MODEL_NAME}
...
--package-uris=gs://{MY_PACKAGE_BUCKET}/my_package-0.1.tar.gz,gs://cloud->ai-pytorch/torch-1.1.0-cp35-cp35m-linux_x86_64.whl
In summary:
1) Remove torch from your install_requires dependencies in setup.py
2) Include torch package when creating your version model.
!gcloud beta ai-platform versions create {VERSION_NAME} --model {MODEL_NAME} \
--origin=gs://{BUCKET}/{MODEL_DIR}/ \
--python-version=3.7 \
--runtime-version={RUNTIME_VERSION} \
--package-uris=gs://{BUCKET}/{PACKAGES_DIR}/text_classification-0.1.tar.gz,gs://cloud-ai-pytorch/torch-1.3.1+cpu-cp37-cp37m-linux_x86_64.whl \
--prediction-class=model_prediction.CustomModelPrediction

How to force python versions to sync in a datalab instance spun from a GCP dataproc cluster?

I've created a Dataproc cluster in GCP using image 1.2. I want to run Spark from a Datalab notebook. This works fine if I keep the Datalab notebook running Python 2.7 as its kernel, but if I want to use Python 3 I run into a minor version mismatch. I demonstrate the mismatch with a Datalab script below:
### Configuration
import sys, os
sys.path.insert(0, '/opt/panera/lib')
os.environ['PYSPARK_PYTHON'] = '/opt/conda/bin/python'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/opt/conda/bin/python'
import google.datalab.storage as storage
from io import BytesIO
spark = SparkSession.builder \
.enableHiveSupport() \
.config("hive.exec.dynamic.partition","true") \
.config("hive.exec.dynamic.partition.mode","nonstrict") \
.config("mapreduce.fileoutputcommitter.marksuccessfuljobs","false") \
.getOrCreate() \
sc = spark.sparkContext
### import libraries
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
from pyspark.mllib.regression import LabeledPoint
### trivial example
data = [
LabeledPoint(0.0, [0.0]),
LabeledPoint(1.0, [1.0]),
LabeledPoint(1.0, [2.0]),
LabeledPoint(1.0, [3.0])
]
toyModel = DecisionTree.trainClassifier(sc.parallelize(data), 2, {})
print(toyModel)
The error:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, pan-bdaas-prod-jrl6-w-3.c.big-data-prod.internal, executor 6): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 124, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 3.6 than that in driver 3.5, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
Other initialization scripts:
gs://dataproc-initialization-actions/cloud-sql-proxy/cloud-sql-proxy.sh
gs://dataproc-initialization-actions/datalab/datalab.sh
...and scripts that load some of our necessary libraries and utilities
The Python 3 kernel in Datalab is using Python 3.5 rather than Python 3.6
You could try to set up a 3.6 environment inside of Datalab and then install a new kernelspec for it, but it is probably easier to just configure the Dataproc cluster to use Python 3.5
The instructions for setting up your cluster to use 3.5 are here

issue with importing dependencies while running Dataflow from google cloud composer

I'm running Dataflow from google cloud composer, the dataflow script contains some non-standard dependencies like zeep, googleads.
which are required to be installed on dataflow worker nodes, so I packaged them with setup.py. when I try to run this in a dag, composer is validating the dataflow files and complaining about No module names Zeep , googleads. So I created pythonvirtualenvoperator and installed all the non standard dependencies required and tried to run the dataflow job and it still complained about inporting zeep and googleads.
Here is my codebase:
PULL_DATA = PythonVirtualenvOperator(
task_id=PROCESS_TASK_ID,
python_callable=execute_dataflow,
op_kwargs={
'main': 'main.py',
'project': PROJECT,
'temp_location': 'gs://bucket/temp',
'setup_file': 'setup.py',
'max_num_workers': 2,
'output': 'gs://bucket/output',
'project_id': PROJECT_ID},
requirements=['google-cloud-storage==1.10.0', 'zeep==3.2.0',
'argparse==1.4.0', 'google-cloud-kms==0.2.1',
'googleads==15.0.2', 'dill'],
python_version='2.7',
use_dill=True,
system_site_packages=True,
on_failure_callback=on_failure_handler,
on_success_callback=on_success_handler,
dag='my-dag')
and my python callable code:
def execute_dataflow(**kwargs):
import subprocess
TEMPLATED_COMMAND = """
python main.py \
--runner DataflowRunner \
--project {project} \
--region us-central1 \
--temp_location {temp_location} \
--setup_file {setup_file} \
--output {output} \
--project_id {project_id}
""".format(**kwargs)
process = subprocess.Popen(['/bin/bash', '-c', TEMPLATED_COMMAND])
process.wait()
return process.returncode
My main.py file
import zeep
import googleads
{Apache-beam-code to construct dataflow pipeline}
Any suggestions?
My job has a requirements.txt. Rather than using the --setup_file option as yours does, it specifies the following:
--requirements_file prod_requirements.txt
This tells DataFlow to install the libraries in requirements.txt prior to running the job.
Reference: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
Using a sample Dataflow pipeline script with import googleads, zeep, I set up a test Composer environment. The DAG is just as yours, and I get the same error.
Then I make a couple of changes, to make sure the dependencies can be found on the worker machines.
In the DAG, I use a plain PythonOperator, not a PythonVirtualenvOperator.
I have my dataflow pipeline and setup file (main.py and setup.py) in a Google Cloud Storage bucket, so Composer can find them.
The setup file has a list of requirements where I need to have e.g. zeep and googleads. I adapted a sample setup file from here, changing this:
REQUIRED_PACKAGES = [
'google-cloud-storage==1.10.0', 'zeep==3.2.0',
'argparse==1.4.0', 'google-cloud-kms==0.2.1',
'googleads==15.0.2', 'dill'
]
setuptools.setup(
name='Imports test',
version='1',
description='Imports test workflow package.',
install_requires=REQUIRED_PACKAGES,
packages=setuptools.find_packages(),
cmdclass={
# Command class instantiated and run during pip install scenarios.
'build': build,
'CustomCommands': CustomCommands,
}
)
My DAG is
with models.DAG( 'composer_sample',
schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
PULL_DATA = PythonOperator(
task_id='PULL_DATA',
python_callable=execute_dataflow,
op_kwargs={
'main': '/home/airflow/gcs/data/main.py',
'project': PROJECT,
'temp_location': 'gs://dataflow-imports-test/temp',
'setup_file': '/home/airflow/gcs/data/setup.py',
'max_num_workers': 2,
'output': 'gs://dataflow-imports-test/output',
'project_id': PROJECT_ID})
PULL_DATA
with no changes to the Python callable. However, with this configuration I still get the error.
Next step, in the Google Cloud Platform (GCP) console, I go to "Composer" through the navigation menu, and then click on the environment's name. On the tab "PyPI packages", I add zeep and googleads, and click "submit". It takes a while to update the environment, but it works.
After this step, my pipeline is able to import the dependencies and run successfully. I also tried running the DAG with the dependencies indicated on the GCP console but not in the requirements of setup.py. And the workflow breaks again, but in different places. So make sure to do indicate them in both places.
You need to install the libraries in your Cloud Composer environment (check out this link). There is a way to do it within the console but I find these steps easier:
Open your environments page
Select the actual environment where your Composer is running
Navigate to the PyPI Packages tab
Click edit
Manually add each line of your requirements.txt
Save
You might get an error if the version you provided for a library is too old, so check the logs and update the numbers, as needed.

File not found error while testing python impyla

I am trying to set up a connection between python and impala. Based on the instructions here I am trying to set up impyla.
I am on a vagrant ubuntu/xenial64 box with python 2.7.12. After reading about some issues with the latest thrift I downgraded to the specified version. After pip installing and setting up the environment variables, I am trying to run the test but it is failing with the file not found error, as follows:
ubuntu#ubuntu-xenial:~/.local/lib/python2.7/site-packages/impala$ py.test --connect impyla
======================================== test session starts ========================================
platform linux2 -- Python 2.7.12, pytest-2.8.7, py-1.4.31, pluggy-0.3.1
rootdir: /home/ubuntu/.local/lib/python2.7/site-packages/impala, inifile:
=================================== no tests ran in 0.00 seconds ====================================
ERROR: file not found: impyla
ubuntu#ubuntu-xenial:~/.local/lib/python2.7/site-packages/impala$
I think I am missing out on something very fundamental here. I have no previous experience with python tests. I would like to stick with python 2.7 because of some other dependencies.
PS:
Please note the latest version provided in the repo readme is 0.13.1 but with my pip install impyla it came out to be 0.14.0.
I am running the test from my site-package/impala directory as the readme says to be in the directory where impyla is.
There is no mention of any inifile in the readme (as far as I could see.)
EDIT 1:
When I run it from site-package directory without --connect I get the same error. But with the argument, it says that the argument is unrecognizable. The output is as follows:
ubuntu#ubuntu-xenial:~/.local/lib/python2.7/site-packages$ py.test --connect impyla
usage: py.test [options] [file_or_dir] [file_or_dir] [...]
py.test: error: unrecognized arguments: --connect
inifile: None
rootdir: /home/ubuntu/.local/lib/python2.7/site-packages
ubuntu#ubuntu-xenial:~/.local/lib/python2.7/site-packages$
Any help on how to troubleshoot this would be helpful.