I'm running Dataflow from google cloud composer, the dataflow script contains some non-standard dependencies like zeep, googleads.
which are required to be installed on dataflow worker nodes, so I packaged them with setup.py. when I try to run this in a dag, composer is validating the dataflow files and complaining about No module names Zeep , googleads. So I created pythonvirtualenvoperator and installed all the non standard dependencies required and tried to run the dataflow job and it still complained about inporting zeep and googleads.
Here is my codebase:
PULL_DATA = PythonVirtualenvOperator(
task_id=PROCESS_TASK_ID,
python_callable=execute_dataflow,
op_kwargs={
'main': 'main.py',
'project': PROJECT,
'temp_location': 'gs://bucket/temp',
'setup_file': 'setup.py',
'max_num_workers': 2,
'output': 'gs://bucket/output',
'project_id': PROJECT_ID},
requirements=['google-cloud-storage==1.10.0', 'zeep==3.2.0',
'argparse==1.4.0', 'google-cloud-kms==0.2.1',
'googleads==15.0.2', 'dill'],
python_version='2.7',
use_dill=True,
system_site_packages=True,
on_failure_callback=on_failure_handler,
on_success_callback=on_success_handler,
dag='my-dag')
and my python callable code:
def execute_dataflow(**kwargs):
import subprocess
TEMPLATED_COMMAND = """
python main.py \
--runner DataflowRunner \
--project {project} \
--region us-central1 \
--temp_location {temp_location} \
--setup_file {setup_file} \
--output {output} \
--project_id {project_id}
""".format(**kwargs)
process = subprocess.Popen(['/bin/bash', '-c', TEMPLATED_COMMAND])
process.wait()
return process.returncode
My main.py file
import zeep
import googleads
{Apache-beam-code to construct dataflow pipeline}
Any suggestions?
My job has a requirements.txt. Rather than using the --setup_file option as yours does, it specifies the following:
--requirements_file prod_requirements.txt
This tells DataFlow to install the libraries in requirements.txt prior to running the job.
Reference: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
Using a sample Dataflow pipeline script with import googleads, zeep, I set up a test Composer environment. The DAG is just as yours, and I get the same error.
Then I make a couple of changes, to make sure the dependencies can be found on the worker machines.
In the DAG, I use a plain PythonOperator, not a PythonVirtualenvOperator.
I have my dataflow pipeline and setup file (main.py and setup.py) in a Google Cloud Storage bucket, so Composer can find them.
The setup file has a list of requirements where I need to have e.g. zeep and googleads. I adapted a sample setup file from here, changing this:
REQUIRED_PACKAGES = [
'google-cloud-storage==1.10.0', 'zeep==3.2.0',
'argparse==1.4.0', 'google-cloud-kms==0.2.1',
'googleads==15.0.2', 'dill'
]
setuptools.setup(
name='Imports test',
version='1',
description='Imports test workflow package.',
install_requires=REQUIRED_PACKAGES,
packages=setuptools.find_packages(),
cmdclass={
# Command class instantiated and run during pip install scenarios.
'build': build,
'CustomCommands': CustomCommands,
}
)
My DAG is
with models.DAG( 'composer_sample',
schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
PULL_DATA = PythonOperator(
task_id='PULL_DATA',
python_callable=execute_dataflow,
op_kwargs={
'main': '/home/airflow/gcs/data/main.py',
'project': PROJECT,
'temp_location': 'gs://dataflow-imports-test/temp',
'setup_file': '/home/airflow/gcs/data/setup.py',
'max_num_workers': 2,
'output': 'gs://dataflow-imports-test/output',
'project_id': PROJECT_ID})
PULL_DATA
with no changes to the Python callable. However, with this configuration I still get the error.
Next step, in the Google Cloud Platform (GCP) console, I go to "Composer" through the navigation menu, and then click on the environment's name. On the tab "PyPI packages", I add zeep and googleads, and click "submit". It takes a while to update the environment, but it works.
After this step, my pipeline is able to import the dependencies and run successfully. I also tried running the DAG with the dependencies indicated on the GCP console but not in the requirements of setup.py. And the workflow breaks again, but in different places. So make sure to do indicate them in both places.
You need to install the libraries in your Cloud Composer environment (check out this link). There is a way to do it within the console but I find these steps easier:
Open your environments page
Select the actual environment where your Composer is running
Navigate to the PyPI Packages tab
Click edit
Manually add each line of your requirements.txt
Save
You might get an error if the version you provided for a library is too old, so check the logs and update the numbers, as needed.
Related
I want to use a custom pypi repo for my Dataflow workers. Typically, to configure a custom pypi repo, you would edit /etc/pip.conf to look like this:
[global]
index-url = https://pypi.customer.com/
Since I can't run a startup script for Dataflow workers, my thought was to perform this operation at the head of my setup.py file, so that as the script executes, it would update /etc/pip.conf before attempting a pip install of the dependencies.
My setup.py looks like the following:
with open('/etc/pip.conf', 'w') as pip_conf:
pip_conf.write("""
[global]
index-url = https://artifactory.mayo.edu/artifactory/api/pypi/pypi-remote/simple
""")
REQUIRED_PACKAGES = [
'custom_package',
]
setuptools.setup(
name='wordcount',
version='0.0.1',
description='demo package.',
install_requires=REQUIRED_PACKAGES,
packages=setuptools.find_packages())
The odd thing is that my workers are hanging indefinitely. When I ssh into them, I see some Docker containers running, but I am not sure how to debug further.
Any suggestions on how I can hack the Dataflow workers to use a custom pypi url?
This is likely a good candidate for custom containers, where you can install everything exactly as you want rather than having to hack the worker startup sequence.
Does anyone know of a way to persist configurations done using "gcloud init" commands inside cloudshell, so they don't vanish each time you disconnect?
I figured out how to persist python pip installs using the --user
example: pip install --user pandas
But, when I create a new configuration using gcloud init, use it for a bit, close cloudshell (or cloudshell times out on me), then reconnect later, the configurations are gone.
Not a big deal, I bounce between projects/etc so it's nice to have the configs saved so I can simply run
gcloud config configurations activate config-name
Thanks...Rich Murnane
Google Cloud Shell only persists data in your $HOME directory. Commands like gcloud init modify the environment variables and store configuration files in /tmp which is deleted when the VM is restarted. The VM is terminated after being idle for 20 minutes or 60 minutes depending on which document you read.
Google Cloud Shell is a Docker container. You can modify the docker image to customize to fit your needs. This method will allow you to install packages, tools, etc that are not located in your $HOME directory.
You can also store your files and configuration scripts on Google Cloud Storage. Modify .bashrc to download your cloud files and run your configuration script.
Either method will allow you to create a persistent environment.
This StackOverflow answer covers in detail what gcloud init does and how to basically emulate the same thing via script or command line.
gcloud init details
this isn't exactly what I wanted, but since my
account (userid) isn't changing, I'm simply going to
do the command
gcloud config set project second-project-name
good enough, thanks...Rich
As Google Cloud Composer uses Cloud Storage to store Apache Airflow DAGs. However, where the operators are stored ? I am getting an error as below:
Broken DAG: [/home/airflow/gcs/dags/example_pubsub_flow.py] cannot import name PubSubSubscriptionCreateOperator.
This operator was added in Airflow 1.10.0 . As of today, Cloud Composer is still using Airflow 1.9.0, hence this operator is not available yet. You can add this as a plugin.
apparently, according to the following post in this message in the Composer Google Group list, to install as a plugin the contrib is not needed to add the Plugin boilerplate.
It is enough with registering the plugins via this command:
gcloud beta composer environments storage plugins import --environment dw --location us-central1 --source=custom_operators.py
See here for detail.
The drawback is that if your contrib operator uses others you will have to copy also those and modify the way they are imported in python, using:
from my_custom_operator import MyCustomOperator
instead of:
from airflow.contrib.operators.my_custom_operator import MyCustomOperator
I have a working Dataflow pipeline the first runs setup.py to install some local helper modules. I now want to use Cloud Composer/Apache Airflow to schedule the pipeline. I've created my DAG file and placed it in the designated Google Storage DAG folder along with my pipeline project. The folder structure looks like this:
{Composer-Bucket}/
dags/
--DAG.py
Pipeline-Project/
--Pipeline.py
--setup.py
Module1/
--__init__.py
Module2/
--__init__.py
Module3/
--__init__.py
The part of my DAG that specifies the setup.py file looks like this:
resumeparserop = dataflow_operator.DataFlowPythonOperator(
task_id="resumeparsertask",
py_file="gs://{COMPOSER-BUCKET}/dags/Pipeline-Project/Pipeline.py",
dataflow_default_options={
"project": {PROJECT-NAME},
"setup_file": "gs://{COMPOSER-BUCKET}/dags/Pipeline-Project/setup.py"})
However, when I look at the logs in the Airflow Web UI, I get the error:
RuntimeError: The file gs://{COMPOSER-BUCKET}/dags/Pipeline-Project/setup.py cannot be found. It was specified in the --setup_file command line option.
I am not sure why it is unable to find the setup file. How can I run my Dataflow pipeline with the setup file/modules?
If you look at the code for DataflowPythonOperator it looks like the main py_file can be a file inside of a GCS bucket and is localized by the operator prior to executing the pipeline. However, I do not see anything like that for the dataflow_default_options. It appears that the options are simply copied and formatted.
Since the GCS dag folder is mounted on the Airflow instances using Cloud Storage Fuse you should be able to access the file locally using the "dags_folder" env var.
i.e. you could do something like this:
from airflow import configuration
....
LOCAL_SETUP_FILE = os.path.join(
configuration.get('core', 'dags_folder'), 'Pipeline-Project', 'setup.py')
You can then use the LOCAL_SETUP_FILE variable for the setup_file property in the dataflow_default_options.
Do you run Composer and Dataflow with the same service account, or are they separate? In the latter case, have you checked whether Dataflow's service account has read access to the bucket and object?
I'm trying to run my code of machine learning from images using tensorflow in Google CloudML. However, it seems the submitted job can't access to my files in my cloud shell or in GCS. Even though it is working fine in my local machine, I get the following error once I submit my job using the command gcloud from the cloud shell:
ERROR 2017-12-19 13:52:28 +0100 service IOError: [Errno 2] No such file or directory: '/home/user/pores-project-googleML/trainer/train.txt'
This folder can be found for sure in cloud shell, and I can check it when I type:
ls /home/user/pores-project-googleML/trainer/train.txt
I tried putting my file train.txt in GCS and access to it from my code (by specifying the path gs://my_bucket/my_path), but once the job submitted, I got a 'No such file or directory' error with the corresponding path.
To check where the job I submitted using gcloud is running, I added print(os.getcwd()) in the beginning of my python code trainer/task.py, which printed as a result in the logs: /user_dir. I couldn't find this path using the cloud shell, not even in GCS. So my question is how can I know in which machine my job is running? If it's in a certain container somewhere, how can I access from it to my files using the cloud shell and in GCS?
Before I do all of this, I succesfully completed the 'Image Classification using Flowers Dataset' tutorial.
The command I used to submit my job is:
gcloud ml-engine jobs submit training $JOB_NAME --job-dir $JOB_DIR --packages trainer-0.1.tar.gz --module-name $MAIN_TRAINER_MODULE --region us-central1
where:
TRAINER_PACKAGE_PATH=/home/use/pores-project-googleML/trainer
MAIN_TRAINER_MODULE="trainer.task"
JOB_DIR="gs://pores/AlexNet_CloudML/job_dir/"
JOB_NAME="census$(date +"%Y%m%d_%H%M%S")"
Regular Python IO library is not able to access files on GCS. Instead, you need to use GCS python client or gstuil cli to access GCS files.
Note that TensorFlow itself has native support of GCS (i.e., it can read GCS files directly).