PGPy won't go on GCP Dataflow pipeline - google-cloud-platform

I'm trying to use PGPy library in a custom GCP Dataflow pipeline implemented with Apache Beam.
What I get is that everything works with DirectRunner, but when I deploy the job and execute it on DataflowRunner I get an error on PGPy usage:
ModuleNotFoundError: No module named 'pgpy'
I think I'm missing something with DataflowRunner.
Thank you

In order to manage pipeline dependencies please refer to :
https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
My personal preference is to go straight to using setup.py as it lets you deal with multiple file dependencies, which tends to get used once the pipeline gets more complex.

Related

How to import pipeline.gocd.yaml from github repo to build gocd pipeline

I'm quite new to go-cd. I have a pipeline.gocd.yaml in my git repo-in which i have defined my pipeline. Is there a way to I can import this into my go-cd server (through the agent) to build the pipeline.
I can't seem to find a way. Any help will be much appreciated.
you can use the config repository plugin which scans the repo for any *.gocd.yaml files and automatically creates the pipeline, groups, configuration etc.
https://github.com/tomzo/gocd-yaml-config-plugin

Accessing Airflow REST API in AWS Managed Workflows?

I have Airflow running in AWS MWAA, I would like to access REST API and there are 2 ways to do this but doesn't seem to work for me.
Overriding api.auth_backend. This used to work and now AWS MWAA won't allow you to add this, it is consider as 'blocklist' and not allow.
api.auth_backend = airflow.api.auth.backend.default
Using MWAA Cli(Python). This doesn't work if any of the DAGs uses packages that are in requirments.txt file.
a. as an example, I have "paramiko" in requirements.txt because I have a task that uses SSHOperator. The MWAA Cli fails with "no module paramiko"
b. Also noted here, https://docs.aws.amazon.com/mwaa/latest/userguide/access-airflow-ui.html
"Any command that parses a DAG (such as list_dags, backfill) will fail if the DAG uses plugins that depend on packages that are installed through requirements.txt."
We are using MWAA 2.0.2 and managed to use Airflow's Rest-API through MWAA CLI, basically following the instructions and sample codes of the Apache Airflow CLI command reference. You'll notice that not all Rest-API calls are supported, but many of them are (even when you have a requirements.txt in place).
Also have a look at AWS sample codes on GitHub.

NotImplementedError in Google Cloud Dataflow and PubSub

I encountered an error when I was testing Dataflow locally on my computer. I intended to use the streaming service from Dataflow and got a "NotImplementedError". The detail of the error is like this:
I thought it might be caused by some package versions. The following is the list of dependencies in the setup.py file.
'google-api-core==1.4.1',
'google-auth==1.5.1',
'google-cloud-core==0.28.1',
'google-cloud-storage==1.10.0',
'google-resumable-media==0.3.1',
'googleapis-common-protos==1.5.3',
'librosa==0.6.2',
'wave==0.0.2',
'scipy==1.1.0',
'google-api-python-client==1.7.4',
'oauth2client==4.1.2',
'resampy==0.2.1',
'keen==0.5.1',
'google-cloud-bigquery==1.5.0',
'apache-beam[gcp]==2.5.0',
'google-cloud-dataflow==2.5.0',
'six==1.10.0',
'google-cloud-logging==1.7.0'
Could anyone help me solve this problem?

How can I run a Dataflow pipeline with a setup file using Cloud Composer/Apache Airflow?

I have a working Dataflow pipeline the first runs setup.py to install some local helper modules. I now want to use Cloud Composer/Apache Airflow to schedule the pipeline. I've created my DAG file and placed it in the designated Google Storage DAG folder along with my pipeline project. The folder structure looks like this:
{Composer-Bucket}/
dags/
--DAG.py
Pipeline-Project/
--Pipeline.py
--setup.py
Module1/
--__init__.py
Module2/
--__init__.py
Module3/
--__init__.py
The part of my DAG that specifies the setup.py file looks like this:
resumeparserop = dataflow_operator.DataFlowPythonOperator(
task_id="resumeparsertask",
py_file="gs://{COMPOSER-BUCKET}/dags/Pipeline-Project/Pipeline.py",
dataflow_default_options={
"project": {PROJECT-NAME},
"setup_file": "gs://{COMPOSER-BUCKET}/dags/Pipeline-Project/setup.py"})
However, when I look at the logs in the Airflow Web UI, I get the error:
RuntimeError: The file gs://{COMPOSER-BUCKET}/dags/Pipeline-Project/setup.py cannot be found. It was specified in the --setup_file command line option.
I am not sure why it is unable to find the setup file. How can I run my Dataflow pipeline with the setup file/modules?
If you look at the code for DataflowPythonOperator it looks like the main py_file can be a file inside of a GCS bucket and is localized by the operator prior to executing the pipeline. However, I do not see anything like that for the dataflow_default_options. It appears that the options are simply copied and formatted.
Since the GCS dag folder is mounted on the Airflow instances using Cloud Storage Fuse you should be able to access the file locally using the "dags_folder" env var.
i.e. you could do something like this:
from airflow import configuration
....
LOCAL_SETUP_FILE = os.path.join(
configuration.get('core', 'dags_folder'), 'Pipeline-Project', 'setup.py')
You can then use the LOCAL_SETUP_FILE variable for the setup_file property in the dataflow_default_options.
Do you run Composer and Dataflow with the same service account, or are they separate? In the latter case, have you checked whether Dataflow's service account has read access to the bucket and object?

nltk dependencies in dataflow

I know that external python dependencies can by fed into Dataflow via the requirements.txt file. I can successfully load nltk in my Dataflow script. However, nltk often needs further files to be downloaded (e.g. stopwords or punkt). Usually on a local run of the script, I can just run
nltk.download('stopwords')
nltk.download('punkt')
and these files will be available to the script. How do I do this so the files are also available to the worker scripts. It seems like it would be extremely inefficient to place those commands into a doFn/CombineFn if they only have to happen once per worker. What part of the script is guaranteed to run once on every worker? That would probably be the place to put the download commands.
According to this, Java allows the staging of resources via classpath. That's not quite what I'm looking for in Python. I'm also not looking for a way to load additional python resources. I just need nltk to find its files.
You can probably use '--setup_file setup.py' to run these custom commands. https://cloud.google.com/dataflow/pipelines/dependencies-python#pypi-dependencies-with-non-python-dependencies . Does this work in your case?