Advice/Guidance - composer/beam/dataflow on gcp - google-cloud-platform

I am trying to learn/try out cloud composer/beam/dataflow on gcp.
I have written functions to do some basic cleaning of data in python, and used a DAG in cloud composer to run this function to download a file from a bucket, process it, and upload it to a bucket at a set frequency.
It was all bespoke written functionality. I am now trying to figure out how I use beam pipeline and data flow instead and use cloud composer to kick off the dataflow job.
The cleaning I am trying to do, is take a csv input of col1,col2,col3,col4,col5 and combine the middle 3 columns to output a csv of col1,combinedcol234,col5.
Questions I have are...
How do I pull in my own functions within a beam pipeline to do this merge?
Should I be pulling in my own functions or do beam have built in ways of doing this?
How do I then trigger a pipeline from a dag?
Does anyone have any example code on git hub?
I have been googling and trying to research but can't seem to find anything that helps me get my head around it enough.
Any help would be appreciated. Thank you.

You can use the DataflowCreatePythonJobOperator to run a dataflow job in a python.
You have to instantiate your cloud composer environment;
Add the dataflow job file in a bucket;
Add the input file to a bucket;
Add the following dag in the DAGs directory of the composer environment:
composer_dataflow_dag.py:
import datetime
from airflow import models
from airflow.providers.google.cloud.operators.dataflow import DataflowCreatePythonJobOperator
from airflow.utils.dates import days_ago
bucket_path = "gs://<bucket name>"
project_id = "<project name>"
gce_zone = "us-central1-a"
import pytz
tz = pytz.timezone('US/Pacific')
tstmp = datetime.datetime.now(tz).strftime('%Y%m%d%H%M%S')
default_args = {
# Tell airflow to start one day ago, so that it runs as soon as you upload it
"start_date": days_ago(1),
"dataflow_default_options": {
"project": project_id,
# Set to your zone
"zone": gce_zone,
# This is a subfolder for storing temporary files, like the staged pipeline job.
"tempLocation": bucket_path + "/tmp/",
},
}
with models.DAG(
"composer_dataflow_dag",
default_args=default_args,
schedule_interval=datetime.timedelta(days=1), # Override to match your needs
) as dag:
create_mastertable = DataflowCreatePythonJobOperator(
task_id="create_mastertable",
py_file=f'gs://<bucket name>/dataflow-job.py',
options={"runner":"DataflowRunner","project":project_id,"region":"us-central1" ,"temp_location":"gs://<bucket name>/", "staging_location":"gs://<bucket name>/"},
job_name=f'job{tstmp}',
location='us-central1',
wait_until_finished=True,
)
Here is the dataflow job file, with the modification you want to concatenate some columns data:
dataflow-job.py
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
import os
from datetime import datetime
import pytz
tz = pytz.timezone('US/Pacific')
tstmp = datetime.now(tz).strftime('%Y-%m-%d %H:%M:%S')
bucket_path = "gs://<bucket>"
input_file = f'{bucket_path}/inputFile.txt'
output = f'{bucket_path}/output_{tstmp}.txt'
p = beam.Pipeline(options=PipelineOptions())
( p | 'Read from a File' >> beam.io.ReadFromText(input_file, skip_header_lines=1)
| beam.Map(lambda x:x.split(","))
| beam.Map(lambda x:f'{x[0]},{x[1]}{x[2]}{x[3]},{x[4]}')
| beam.io.WriteToText(output) )
p.run().wait_until_finish()
After running the result will be stored in the gcs Bucket:

A beam program is just an ordinary Python program that builds up a pipeline and runs it. For example
'''
def main():
with beam.Pipline() as p:
p | beam.io.ReadFromText(...) | beam.Map(...) | beam.io.WriteToText(...)
'''
Many examples can be found in the repository and the programming guide is useful toohttps://beam.apache.org/documentation/programming-guide/ . The easiest way to read CSV files is with the dataframes API which retruns an object you can manipulate as if it were a Pandas Dataframe, or you can turn into a PCollection (where each column is an attribute of a named tuple) and process with Beam's Map, FlatMap, etc, e.g.
pcoll | beam.Map(
lambda row: (row.col1, func(row.col2, row.col3, row.col4), row.col5)))

Related

loop through multiple tables from source to s3 using glue (Python/Pyspark) through configuration file?

I am looking ingest multiple tables from a relational database to s3 using glue. The table details are present in a configuration file. The configuration file is a json file. Would be helpful to have a code that can loop through multiple table names and ingests these tables into s3. The glue script is written in python (pyspark)
this is sample how the configuration file looks :
{"main_key":{
"source_type": "rdbms",
"source_schema": "DATABASE",
"source_table": "DATABASE.Table_1",
}}
Assuming your Glue job can connect to the database and a Glue Connection has been added to it. Here's a sample extracted from my script that does something similar, you would need to update the jdbc url format that works for your database, this one uses sql server, implementation details for fetching the config file, looping through items, etc.
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from datetime import datetime
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
jdbc_url = f"jdbc:sqlserver://{hostname}:{port};databaseName={db_name}"
connection_details = {
"user": 'db_user',
"password": 'db_password',
"driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver",
}
tables_config = get_tables_config_from_s3_as_dict()
date_partition = datetime.today().strftime('%Y%m%d')
write_date_partition = f'year={date_partition[0:4]}/month={date_partition[4:6]}/day={date_partition[6:8]}'
for key, value in tables_config.items():
table = value['source_table']
df = spark.read.jdbc(url=jdbc_url, table=table, properties=connection_details)
write_path = f's3a://bucket-name/{table}/{write_date_partition}'
df.write.parquet(write_path)
Just write a normal for loop to loop through your DB configuration then follow Spark JDBC documentation to connect to each of them in sequence.

AWS Glue job to unzip a file from S3 and write it back to S3

I'm very new to AWS Glue, and I want to use AWS Glue to unzip a huge file present in a S3 bucket, and write the contents back to S3.
I couldn't find anything while trying to google this requirement.
My questions are:
How to add a zip file as data source to AWS Glue?
How to write it back to same S3 location?
I am using AWS Glue Studio. Any help will be highly appreciated.
If you are still looking for a solution. You're able to unzip a file and write it back with an AWS Glue Job by using boto3 and Python's zipfile library.
A thing to consider is the size of the zip that you want to process. I've used the following script with a 6GB (zipped) 30GB (unzipped) file and it works fine. But might fail if the file is to heavy for the worker to buffer.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
import boto3
import io
from zipfile import ZipFile
s3 = boto3.client("s3")
bucket = "wayfair-datasource" # your s3 bucket name
prefix = "files/location/" # the prefix for the objects that you want to unzip
unzip_prefix = "files/unzipped_location/" # the location where you want to store your unzipped files
# Get a list of all the resources in the specified prefix
objects = s3.list_objects(
Bucket=bucket,
Prefix=prefix
)["Contents"]
# The following will get the unzipped files so the job doesn't try to unzip a file that is already unzipped on every run
unzipped_objects = s3.list_objects(
Bucket=bucket,
Prefix=unzip_prefix
)["Contents"]
# Get a list containing the keys of the objects to unzip
object_keys = [ o["Key"] for o in objects if o["Key"].endswith(".zip") ]
# Get the keys for the unzipped objects
unzipped_object_keys = [ o["Key"] for o in unzipped_objects ]
for key in object_keys:
obj = s3.get_object(
Bucket="wayfair-datasource",
Key=key
)
objbuffer = io.BytesIO(obj["Body"].read())
# using context manager so you don't have to worry about manually closing the file
with ZipFile(objbuffer) as zip:
filenames = zip.namelist()
# iterate over every file inside the zip
for filename in filenames:
with zip.open(filename) as file:
filepath = unzip_prefix + filename
if filepath not in unzipped_object_keys:
s3.upload_fileobj(file, bucket, filepath)
job.commit()
I couldn't find anything while trying to google this requirement.
You couldn't find anything about this, because this is not what Glue does. Glue can read gzip (not zip) files natively. If you have zip, then you have to convert all the files yourself in S3. Glue will not do it.
To convert the files, you can download them, re-pack, and re-upload in gzip format, or any other format that Glue supports.

Error when creating Google Dataflow template file

I'm trying to schedule a Dataflow that ends after a set amount of time using a template. I'm able to successfully do this when using the command line, but when I try and do it with Google Cloud Scheduler I run into an error when I create my template.
The error is
File "pipelin_stream.py", line 37, in <module>
main()
File "pipelin_stream.py", line 34, in main
result.cancel()
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1638, in cancel
raise IOError('Failed to get the Dataflow job id.')
IOError: Failed to get the Dataflow job id.
The command I'm using to make the template is
python pipelin_stream.py \
--runner Dataflowrunner \
--project $PROJECT \
--temp_location $BUCKET/tmp \
--staging_location $BUCKET/staging \
--template_location $BUCKET/templates/time_template_test \
--streaming
And the pipeline file I have is this
from apache_beam.options.pipeline_options import PipelineOptions
from google.cloud import pubsub_v1
from google.cloud import bigquery
import apache_beam as beam
import logging
import argparse
import sys
PROJECT = 'projectID'
schema = 'ex1:DATE, ex2:STRING'
TOPIC = "projects/topic-name/topics/scraping-test"
def main(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument("--input_topic")
parser.add_argument("--output")
known_args = parser.parse_known_args(argv)
p = beam.Pipeline(options=PipelineOptions(region='us-central1', service_account_email='email'))
(p
| 'ReadData' >> beam.io.ReadFromPubSub(topic=TOPIC).with_output_types(bytes)
| 'Decode' >> beam.Map(lambda x:x.decode('utf-8'))
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery('tablename'.format(PROJECT), schema=schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
result = p.run()
result.wait_until_finish(duration=3000)
result.cancel() # If the pipeline has not finished, you can cancel it
if __name__ == '__main__':
logger = logging.getLogger().setLevel(logging.INFO)
main()
Does anyone have an idea why I might be getting this error?
The error is raised by the cancel function after the waiting time and it appears to be harmless.
To prove it, I managed to reproduce your exact issue from my virtual machine with python 3.5. The template is created in the given path by --template_location and can be used to run jobs. Note that I needed to apply some changes to your code to get it to actually work in Dataflow.
In case it is of any use to you, I ended up using this pipeline code
from apache_beam.options.pipeline_options import PipelineOptions
from google.cloud import pubsub_v1
from google.cloud import bigquery
import apache_beam as beam
import logging
import argparse
import datetime
# Fill this values in order to have them by default
# Note that the table in BQ needs to have the column names message_body and publish_time
Table = 'projectid:datasetid.tableid'
schema = 'ex1:STRING, ex2:TIMESTAMP'
TOPIC = "projects/<projectid>/topics/<topicname>"
class AddTimestamps(beam.DoFn):
def process(self, element, publish_time=beam.DoFn.TimestampParam):
"""Processes each incoming element by extracting the Pub/Sub
message and its publish timestamp into a dictionary. `publish_time`
defaults to the publish timestamp returned by the Pub/Sub server. It
is bound to each element by Beam at runtime.
"""
yield {
"message_body": element.decode("utf-8"),
"publish_time": datetime.datetime.utcfromtimestamp(
float(publish_time)
).strftime("%Y-%m-%d %H:%M:%S.%f"),
}
def main(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument("--input_topic", default=TOPIC)
parser.add_argument("--output_table", default=Table)
args, beam_args = parser.parse_known_args(argv)
# save_main_session needs to be set to true due to modules being used among the code (mostly datetime)
# Uncomment the service account email to specify a custom service account
p = beam.Pipeline(argv=beam_args,options=PipelineOptions(save_main_session=True,
region='us-central1'))#, service_account_email='email'))
(p
| 'ReadData' >> beam.io.ReadFromPubSub(topic=args.input_topic).with_output_types(bytes)
| "Add timestamps to messages" >> beam.ParDo(AddTimestamps())
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery(args.output_table, schema=schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
result = p.run()
#Warning: Cancel does not work properly in a template
result.wait_until_finish(duration=3000)
result.cancel() # Cancel the streaming pipeline after a while to avoid consuming more resources
if __name__ == '__main__':
logger = logging.getLogger().setLevel(logging.INFO)
main()
Afterwards I ran commands:
# Fill accordingly
PROJECT="MYPROJECT-ID"
BUCKET="MYBUCKET"
TEMPLATE_NAME="TRIAL"
# create the template
python3 -m templates.template-pubsub-bigquery \
--runner DataflowRunner \
--project $PROJECT \
--staging_location gs://$BUCKET/staging \
--temp_location gs://$BUCKET/temp \
--template_location gs://$BUCKET/templates/$TEMPLATE_NAME \
--streaming
to create the pipeline (which yields the error you mentioned but still creates the template).
And
# Fill job-name and gcs location accordingly
# Uncomment and fill the parameters should you want to use your own
gcloud dataflow jobs run <job-name> \
--gcs-location "gs://<MYBUCKET>/dataflow/templates/mytemplate"
# --parameters input_topic="", output_table=""
To run the pipeline.
As I said, the template was properly created and the pipeline worked properly.
Edit
Indeed the cancel function does not work properly in the template. It seems to be an issue with it needing the job id on template creation which of course it does not exist and as a result it omits the function.
I found this other post that handles extracting the Job id on the pipeline. I tried some tweaks to make it work within the template code itself but I think is not necessary. Given you want to schedule their execution I would go for the easier option and execute the streaming pipeline template at a certain time (e.g. 9:01 GMT) and cancel the pipeline with script
import logging, re,os
from googleapiclient.discovery import build
from oauth2client.client import GoogleCredentials
def retrieve_job_id():
#Fill as needed
project = '<project-id>'
job_prefix = "<job-name>"
location = '<location>'
logging.info("Looking for jobs with prefix {} in region {}...".format(job_prefix, location))
try:
credentials = GoogleCredentials.get_application_default()
dataflow = build('dataflow', 'v1b3', credentials=credentials)
result = dataflow.projects().locations().jobs().list(
projectId=project,
location=location,
).execute()
job_id = "none"
for job in result['jobs']:
if re.findall(r'' + re.escape(job_prefix) + '', job['name']):
job_id = job['id']
break
logging.info("Job ID: {}".format(job_id))
return job_id
except Exception as e:
logging.info("Error retrieving Job ID")
raise KeyError(e)
os.system('gcloud dataflow jobs cancel {}'.format(retrieve_job_id()))
at another time (e.g. 9:05 GMT). This script assumes you are running the script with the same job name each time and takes the latest appearance of the name and cancels it. I tried it several times and it works fine.

Accessing Templated Runtime Parameters in Google Cloud Dataflow - Python

I am experimenting with creating my own Templates for Google Cloud Dataflow, so that jobs can be executed from the GUI, making it easier for others to execute. I have followed the tutorials, created my own class of PipelineOptions, and populated it with the parser.add_value_provider_argument() method. When I then try to pass these arguments into the pipeline, using my_options.argname.get(), I get an error, telling me the item is not called from a runtime context. I don't understand this. The args aren't part of the defining the pipeline graph itself, they are just parameters such as input filename, output tablename, etc.
Below is the code. It works if I hardcode the input filename, output tablename, write Disposition, and delimiter. If I replace these with their my_options.argname.get() equivalent, it fails. In the snippet shown, I have hardcoded everything except the outputBQTable name, where I use my_options.outputBQTable.get(). This fails, with the following message.
apache_beam.error.RuntimeValueProviderError: RuntimeValueProvider(option: outputBQTable, type: str, default_value: 'dataflow_csv_reader_testing.names').get() not called from a runtime context
I appreciate any guidance on how to get this to work.
import apache_beam
from apache_beam.io.gcp.gcsio import GcsIO
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.value_provider import RuntimeValueProvider
import csv
import argparse
class MyOptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls,parser):
parser.add_value_provider_argument('--inputGCS', type=str,
default='gs://mybucket/df-python-csv-test/test-dict.csv',
help='Input gcs csv file, full path and filename')
parser.add_value_provider_argument('--delimiter', type=str,
default=',',
help='Character used as delimiter in csv file, default is ,')
parser.add_value_provider_argument('--outputBQTable', type=str,
default='dataflow_csv_reader_testing.names',
help='Output BQ Dataset.Table to write to')
parser.add_value_provider_argument('--writeDisposition', type=str,
default='WRITE_APPEND',
help='BQ write disposition, WRITE_TRUNCATE or WRITE_APPEND or WRITE_EMPTY')
def main():
optlist=PipelineOptions()
my_options=optlist.view_as(MyOptions)
p = apache_beam.Pipeline(options=optlist)
(p
| 'create' >> apache_beam.Create(['gs://mybucket/df-python-csv-test/test-dict.csv'])
| 'read gcs csv dict' >> apache_beam.FlatMap(lambda file: csv.DictReader(apache_beam.io.gcp.gcsio.GcsIO().open(file,'r'), delimiter='|'))
| 'write bq record' >> apache_beam.io.Write(apache_beam.io.BigQuerySink(my_options.outputBQTable.get(), write_disposition='WRITE_TRUNCATE'))
)
p.run()
if __name__ == '__main__':
main()
You cannot use my_options.outputBQTable.get() when specifying the pipeline. The BigQuery sink already knows how to use runtime provided arguments, so I think you can just pass my_options.outputBQTable.
From what I gather from the documentation you should only use options.runtime_argument.get() in the process methods of your DoFns passed to ParDo steps.
Note: I tested with 2.8.0 of the Apache Beam SDK and so I used WriteToBigQuery instead of BigQuerySink.
This is a feature yet to be developed for the Python SDK.
The related open issue can be found at the Apache Beam project page.
Until the above issue is solved, the workaround for now would be to use the Java SDK.

Looking for a boto3 python example of injecting a aws pig step into an already running emr?

I'm looking for a good BOTO3 example of an AWS EMR already running and I wish to inject a Pig Step into that EMR. Previously, I used the boto2.42 version of:
from boto.emr.connection import EmrConnection
from boto.emr.step import InstallPigStep, PigStep
# AWS_ACCESS_KEY = '' # REQUIRED
# AWS_SECRET_KEY = '' # REQUIRED
# conn = EmrConnection(AWS_ACCESS_KEY, AWS_SECRET_KEY)
# loop next element on bucket_compare list
pig_file = 's3://elasticmapreduce/samples/pig-apache/do-reports2.pig'
INPUT = 's3://elasticmapreduce/samples/pig-apache/input/access_log_1'
OUTPUT = '' # REQUIRED, S3 bucket for job output
pig_args = ['-p', 'INPUT=%s' % INPUT,
'-p', 'OUTPUT=%s' % OUTPUT]
pig_step = PigStep('Process Reports', pig_file, pig_args=pig_args)
steps = [InstallPigStep(), pig_step]
conn.run_jobflow(name='prs-dev-test', steps=steps,
hadoop_version='2.7.2-amzn-2', ami_version='latest',
num_instances=2, keep_alive=False)
The main problem now is that, BOTO3 doesn't use: from boto.emr.connection import EmrConnection, nor from boto.emr.step import InstallPigStep, PigStep and I can't find an equivalent set of modules?
After a bit of checking, I've found a very simple way to inject Pig Script commands from within Python using the awscli and subprocess modules. One can import awscli & subprocess, and then encapsulate and inject the desired PIG steps to an already running EMR with:
import awscli
import subprocess
cmd='aws emr add-steps --cluster-id j-GU07FE0VTHNG --steps Type=PIG,Name="AggPigProgram",ActionOnFailure=CONTINUE,Args=[-f,s3://dev-end2end-test/pig_scripts/AggRuleBag.pig,-p,INPUT=s3://dev-end2end-test/input_location,-p,OUTPUT=s3://end2end-test/output_location]'
push=subprocess.Popen(cmd, shell=True, stdout = subprocess.PIPE)
print(push.returncode)
Of course, you'll have to find your JobFlowID using something like:
aws emr list-clusters --active
Using the same subprocess and push command above. Of course you can add monitoring to your hearts delight instead of just a print statement.
Here is how to add a new step to existing emr cluster job flow for a pig job sing boto3
Note: your script log file, input and output directories should have
the complete path in the format
's3://<bucket>/<directory>/<file_or_key>'
emrcon = boto3.client("emr")
cluster_id1 = cluster_status_file_content #Retrieved from S3, where it was recorded on creation
step_id = emrcon.add_job_flow_steps(JobFlowId=str(cluster_id1),
Steps=[{
'Name': str(pig_job_name),
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['pig', "-l", str(pig_log_file_full_path), "-f", str(pig_job_run_script_full_path), "-p", "INPUT=" + str(pig_input_dir_full_path),
"-p", "OUTPUT=" + str(pig_output_dir_full_path) ]
}
}]
)
Please see screenshot to monitor-