Error when creating Google Dataflow template file - google-cloud-platform

I'm trying to schedule a Dataflow that ends after a set amount of time using a template. I'm able to successfully do this when using the command line, but when I try and do it with Google Cloud Scheduler I run into an error when I create my template.
The error is
File "pipelin_stream.py", line 37, in <module>
main()
File "pipelin_stream.py", line 34, in main
result.cancel()
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1638, in cancel
raise IOError('Failed to get the Dataflow job id.')
IOError: Failed to get the Dataflow job id.
The command I'm using to make the template is
python pipelin_stream.py \
--runner Dataflowrunner \
--project $PROJECT \
--temp_location $BUCKET/tmp \
--staging_location $BUCKET/staging \
--template_location $BUCKET/templates/time_template_test \
--streaming
And the pipeline file I have is this
from apache_beam.options.pipeline_options import PipelineOptions
from google.cloud import pubsub_v1
from google.cloud import bigquery
import apache_beam as beam
import logging
import argparse
import sys
PROJECT = 'projectID'
schema = 'ex1:DATE, ex2:STRING'
TOPIC = "projects/topic-name/topics/scraping-test"
def main(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument("--input_topic")
parser.add_argument("--output")
known_args = parser.parse_known_args(argv)
p = beam.Pipeline(options=PipelineOptions(region='us-central1', service_account_email='email'))
(p
| 'ReadData' >> beam.io.ReadFromPubSub(topic=TOPIC).with_output_types(bytes)
| 'Decode' >> beam.Map(lambda x:x.decode('utf-8'))
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery('tablename'.format(PROJECT), schema=schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
result = p.run()
result.wait_until_finish(duration=3000)
result.cancel() # If the pipeline has not finished, you can cancel it
if __name__ == '__main__':
logger = logging.getLogger().setLevel(logging.INFO)
main()
Does anyone have an idea why I might be getting this error?

The error is raised by the cancel function after the waiting time and it appears to be harmless.
To prove it, I managed to reproduce your exact issue from my virtual machine with python 3.5. The template is created in the given path by --template_location and can be used to run jobs. Note that I needed to apply some changes to your code to get it to actually work in Dataflow.
In case it is of any use to you, I ended up using this pipeline code
from apache_beam.options.pipeline_options import PipelineOptions
from google.cloud import pubsub_v1
from google.cloud import bigquery
import apache_beam as beam
import logging
import argparse
import datetime
# Fill this values in order to have them by default
# Note that the table in BQ needs to have the column names message_body and publish_time
Table = 'projectid:datasetid.tableid'
schema = 'ex1:STRING, ex2:TIMESTAMP'
TOPIC = "projects/<projectid>/topics/<topicname>"
class AddTimestamps(beam.DoFn):
def process(self, element, publish_time=beam.DoFn.TimestampParam):
"""Processes each incoming element by extracting the Pub/Sub
message and its publish timestamp into a dictionary. `publish_time`
defaults to the publish timestamp returned by the Pub/Sub server. It
is bound to each element by Beam at runtime.
"""
yield {
"message_body": element.decode("utf-8"),
"publish_time": datetime.datetime.utcfromtimestamp(
float(publish_time)
).strftime("%Y-%m-%d %H:%M:%S.%f"),
}
def main(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument("--input_topic", default=TOPIC)
parser.add_argument("--output_table", default=Table)
args, beam_args = parser.parse_known_args(argv)
# save_main_session needs to be set to true due to modules being used among the code (mostly datetime)
# Uncomment the service account email to specify a custom service account
p = beam.Pipeline(argv=beam_args,options=PipelineOptions(save_main_session=True,
region='us-central1'))#, service_account_email='email'))
(p
| 'ReadData' >> beam.io.ReadFromPubSub(topic=args.input_topic).with_output_types(bytes)
| "Add timestamps to messages" >> beam.ParDo(AddTimestamps())
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery(args.output_table, schema=schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
result = p.run()
#Warning: Cancel does not work properly in a template
result.wait_until_finish(duration=3000)
result.cancel() # Cancel the streaming pipeline after a while to avoid consuming more resources
if __name__ == '__main__':
logger = logging.getLogger().setLevel(logging.INFO)
main()
Afterwards I ran commands:
# Fill accordingly
PROJECT="MYPROJECT-ID"
BUCKET="MYBUCKET"
TEMPLATE_NAME="TRIAL"
# create the template
python3 -m templates.template-pubsub-bigquery \
--runner DataflowRunner \
--project $PROJECT \
--staging_location gs://$BUCKET/staging \
--temp_location gs://$BUCKET/temp \
--template_location gs://$BUCKET/templates/$TEMPLATE_NAME \
--streaming
to create the pipeline (which yields the error you mentioned but still creates the template).
And
# Fill job-name and gcs location accordingly
# Uncomment and fill the parameters should you want to use your own
gcloud dataflow jobs run <job-name> \
--gcs-location "gs://<MYBUCKET>/dataflow/templates/mytemplate"
# --parameters input_topic="", output_table=""
To run the pipeline.
As I said, the template was properly created and the pipeline worked properly.
Edit
Indeed the cancel function does not work properly in the template. It seems to be an issue with it needing the job id on template creation which of course it does not exist and as a result it omits the function.
I found this other post that handles extracting the Job id on the pipeline. I tried some tweaks to make it work within the template code itself but I think is not necessary. Given you want to schedule their execution I would go for the easier option and execute the streaming pipeline template at a certain time (e.g. 9:01 GMT) and cancel the pipeline with script
import logging, re,os
from googleapiclient.discovery import build
from oauth2client.client import GoogleCredentials
def retrieve_job_id():
#Fill as needed
project = '<project-id>'
job_prefix = "<job-name>"
location = '<location>'
logging.info("Looking for jobs with prefix {} in region {}...".format(job_prefix, location))
try:
credentials = GoogleCredentials.get_application_default()
dataflow = build('dataflow', 'v1b3', credentials=credentials)
result = dataflow.projects().locations().jobs().list(
projectId=project,
location=location,
).execute()
job_id = "none"
for job in result['jobs']:
if re.findall(r'' + re.escape(job_prefix) + '', job['name']):
job_id = job['id']
break
logging.info("Job ID: {}".format(job_id))
return job_id
except Exception as e:
logging.info("Error retrieving Job ID")
raise KeyError(e)
os.system('gcloud dataflow jobs cancel {}'.format(retrieve_job_id()))
at another time (e.g. 9:05 GMT). This script assumes you are running the script with the same job name each time and takes the latest appearance of the name and cancels it. I tried it several times and it works fine.

Related

Advice/Guidance - composer/beam/dataflow on gcp

I am trying to learn/try out cloud composer/beam/dataflow on gcp.
I have written functions to do some basic cleaning of data in python, and used a DAG in cloud composer to run this function to download a file from a bucket, process it, and upload it to a bucket at a set frequency.
It was all bespoke written functionality. I am now trying to figure out how I use beam pipeline and data flow instead and use cloud composer to kick off the dataflow job.
The cleaning I am trying to do, is take a csv input of col1,col2,col3,col4,col5 and combine the middle 3 columns to output a csv of col1,combinedcol234,col5.
Questions I have are...
How do I pull in my own functions within a beam pipeline to do this merge?
Should I be pulling in my own functions or do beam have built in ways of doing this?
How do I then trigger a pipeline from a dag?
Does anyone have any example code on git hub?
I have been googling and trying to research but can't seem to find anything that helps me get my head around it enough.
Any help would be appreciated. Thank you.
You can use the DataflowCreatePythonJobOperator to run a dataflow job in a python.
You have to instantiate your cloud composer environment;
Add the dataflow job file in a bucket;
Add the input file to a bucket;
Add the following dag in the DAGs directory of the composer environment:
composer_dataflow_dag.py:
import datetime
from airflow import models
from airflow.providers.google.cloud.operators.dataflow import DataflowCreatePythonJobOperator
from airflow.utils.dates import days_ago
bucket_path = "gs://<bucket name>"
project_id = "<project name>"
gce_zone = "us-central1-a"
import pytz
tz = pytz.timezone('US/Pacific')
tstmp = datetime.datetime.now(tz).strftime('%Y%m%d%H%M%S')
default_args = {
# Tell airflow to start one day ago, so that it runs as soon as you upload it
"start_date": days_ago(1),
"dataflow_default_options": {
"project": project_id,
# Set to your zone
"zone": gce_zone,
# This is a subfolder for storing temporary files, like the staged pipeline job.
"tempLocation": bucket_path + "/tmp/",
},
}
with models.DAG(
"composer_dataflow_dag",
default_args=default_args,
schedule_interval=datetime.timedelta(days=1), # Override to match your needs
) as dag:
create_mastertable = DataflowCreatePythonJobOperator(
task_id="create_mastertable",
py_file=f'gs://<bucket name>/dataflow-job.py',
options={"runner":"DataflowRunner","project":project_id,"region":"us-central1" ,"temp_location":"gs://<bucket name>/", "staging_location":"gs://<bucket name>/"},
job_name=f'job{tstmp}',
location='us-central1',
wait_until_finished=True,
)
Here is the dataflow job file, with the modification you want to concatenate some columns data:
dataflow-job.py
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
import os
from datetime import datetime
import pytz
tz = pytz.timezone('US/Pacific')
tstmp = datetime.now(tz).strftime('%Y-%m-%d %H:%M:%S')
bucket_path = "gs://<bucket>"
input_file = f'{bucket_path}/inputFile.txt'
output = f'{bucket_path}/output_{tstmp}.txt'
p = beam.Pipeline(options=PipelineOptions())
( p | 'Read from a File' >> beam.io.ReadFromText(input_file, skip_header_lines=1)
| beam.Map(lambda x:x.split(","))
| beam.Map(lambda x:f'{x[0]},{x[1]}{x[2]}{x[3]},{x[4]}')
| beam.io.WriteToText(output) )
p.run().wait_until_finish()
After running the result will be stored in the gcs Bucket:
A beam program is just an ordinary Python program that builds up a pipeline and runs it. For example
'''
def main():
with beam.Pipline() as p:
p | beam.io.ReadFromText(...) | beam.Map(...) | beam.io.WriteToText(...)
'''
Many examples can be found in the repository and the programming guide is useful toohttps://beam.apache.org/documentation/programming-guide/ . The easiest way to read CSV files is with the dataframes API which retruns an object you can manipulate as if it were a Pandas Dataframe, or you can turn into a PCollection (where each column is an attribute of a named tuple) and process with Beam's Map, FlatMap, etc, e.g.
pcoll | beam.Map(
lambda row: (row.col1, func(row.col2, row.col3, row.col4), row.col5)))

How can I get list of all cloud SQL ( GCP ) instances which are stopped in python, I am using google cloud api for this purpose

from googleapiclient import discovery
PROJECT = gcp-test-1234
sql_client = discovery.build('sqladmin', 'v1beta4')
resp = sql_client.instances().list(project=PROJECT).execute()
print(resp)
But in response, I am getting a state as "RUNNABLE" for stopped instances, so how can I verify that the instance is running or stopped programmatically
I have also check gcloud sql instances describe gcp-test-1234-test-db, it is providing state as "STOPPED"
how can I achieve this programmatically using python
In the Rest API, the RUNNABLE for the state field means that the instance is running, or has been stopped by the owner, as stated here.
You need to read from the activationPolicy field, where ALWAYS means your instance is running and NEVER means it is stopped. Something like the following will work:
from pprint import pprint
from googleapiclient import discovery
service = discovery.build('sqladmin', 'v1beta4')
project = 'gcp-test-1234'
instance = 'gcp-test-1234-test-db'
request = service.instances().get(project=project,instance=instance)
response = request.execute()
pprint(response['settings']['activationPolicy'])
Another option would be to use the Cloud SDK command directly from your python file:
import os
os.system("gcloud sql instances describe gcp-test-1234-test-db | grep state | awk {'print $2'}")
Or with subprocess:
import subprocess
subprocess.run("gcloud sql instances describe gcp-test-1234-test-db | grep state | awk {'print $2'}", shell=True)
Note that when you run gcloud sql instances describe you-instance --log-http on a stopped instance, in the response of the API, you'll see "state": "RUNNABLE", however, the gcloud command will show the status STOPPED. This is because the output of the command gets the status from the activationPolicy of the API response rather than the status, if the status is RUNNABLE.
If you want to check the piece of code that translates the activationPolicy to the status, you can see it in the SDK. The gcloud tool is written in python:
cat $(gcloud info --format "value(config.paths.sdk_root)")/lib/googlecloudsdk/api_lib/sql/instances.py|grep "class DatabaseInstancePresentation(object)" -A 17
You'll se the following:
class DatabaseInstancePresentation(object):
"""Represents a DatabaseInstance message that is modified for user visibility."""
def __init__(self, orig):
for field in orig.all_fields():
if field.name == 'state':
if orig.settings and orig.settings.activationPolicy == messages.Settings.ActivationPolicyValueValuesEnum.NEVER:
self.state = 'STOPPED'
else:
self.state = orig.state
else:
value = getattr(orig, field.name)
if value is not None and not (isinstance(value, list) and not value):
if field.name in ['currentDiskSize', 'maxDiskSize']:
setattr(self, field.name, six.text_type(value))
else:
setattr(self, field.name, value)

Spark-BigTable - HBase client not closed in Pyspark?

I'm trying to execute a Pyspark statement that writes to BigTable within a Python for loop, which leads to the following error (job submitted using Dataproc). Any client not properly closed (as suggested here) and if yes, any way to do so in Pyspark ?
Note that manually re-executing the script each time with a new Dataproc job works fine, so the job itself is correct.
Thanks for your support !
Pyspark script
from pyspark import SparkContext
from pyspark.sql import SQLContext
import json
sc = SparkContext()
sqlc = SQLContext(sc)
def create_df(n_start,n_stop):
# Data
row_1 = ['a']+['{}'.format(i) for i in range(n_start,n_stop)]
row_2 = ['b']+['{}'.format(i) for i in range(n_start,n_stop)]
# Spark schema
ls = [row_1,row_2]
schema = ['col0'] + ['col{}'.format(i) for i in range(n_start,n_stop)]
# Catalog
first_col = {"col0":{"cf":"rowkey", "col":"key", "type":"string"}}
other_cols = {"col{}".format(i):{"cf":"cf", "col":"col{}".format(i), "type":"string"} for i in range(n_start,n_stop)}
first_col.update(other_cols)
columns = first_col
d_catalogue = {}
d_catalogue["table"] = {"namespace":"default", "name":"testtable"}
d_catalogue["rowkey"] = "key"
d_catalogue["columns"] = columns
catalog = json.dumps(d_catalogue)
# Dataframe
df = sc.parallelize(ls, numSlices=1000).toDF(schema=schema)
return df,catalog
for i in range(0,2):
N_step = 100
N_start = 1
N_stop = N_start+N_step
data_source_format = "org.apache.spark.sql.execution.datasources.hbase"
df,catalog = create_df(N_start,N_stop)
df.write\
.options(catalog=catalog,newTable= "5")\
.format(data_source_format)\
.save()
N_start += N_step
N_stop += N_step
Dataproc job
gcloud dataproc jobs submit pyspark <my_script>.py \
--cluster $SPARK_CLUSTER \
--jars <path_to_jar>/bigtable-dataproc-spark-shc-assembly-0.1.jar \
--region=us-east1
Error
...
ERROR com.google.bigtable.repackaged.io.grpc.internal.ManagedChannelOrphanWrapper: *~*~*~ Channel ManagedChannelImpl{logId=41, target=bigtable.googleapis.com:443} was not shutdown properly!!! ~*~*~*
Make sure to call shutdown()/shutdownNow() and wait until awaitTermination() returns true.
...
If you are not using the latest version, try updating to it. It looks similar to this issue that was fixed recently. I would imagine the error message still showing up, but the job now finishing means that the support team is still working on it and hopefully they will fix it in the next release.

Dataflow stops streaming to BigQuery without errors

We started using Dataflow to read from PubSub and Stream to BigQuery.
Dataflow should work 24/7, because pubsub is constantly updated with analytics data of multiple websites around the world.
Code looks like this:
from __future__ import absolute_import
import argparse
import json
import logging
import apache_beam as beam
from apache_beam.io import ReadFromPubSub, WriteToBigQuery
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
logger = logging.getLogger()
TABLE_IDS = {
'table_1': 0,
'table_2': 1,
'table_3': 2,
'table_4': 3,
'table_5': 4,
'table_6': 5,
'table_7': 6,
'table_8': 7,
'table_9': 8,
'table_10': 9,
'table_11': 10,
'table_12': 11,
'table_13': 12
}
def separate_by_table(element, num):
return TABLE_IDS[element.get('meta_type')]
class ExtractingDoFn(beam.DoFn):
def process(self, element):
yield json.loads(element)
def run(argv=None):
"""Main entry point; defines and runs the wordcount pipeline."""
logger.info('STARTED!')
parser = argparse.ArgumentParser()
parser.add_argument('--topic',
dest='topic',
default='projects/PROJECT_NAME/topics/TOPICNAME',
help='Gloud topic in form "projects/<project>/topics/<topic>"')
parser.add_argument('--table',
dest='table',
default='PROJECTNAME:DATASET_NAME.event_%s',
help='Gloud topic in form "PROJECT:DATASET.TABLE"')
known_args, pipeline_args = parser.parse_known_args(argv)
# We use the save_main_session option because one or more DoFn's in this
# workflow rely on global context (e.g., a module imported at module level).
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = True
p = beam.Pipeline(options=pipeline_options)
lines = p | ReadFromPubSub(known_args.topic)
datas = lines | beam.ParDo(ExtractingDoFn())
by_table = datas | beam.Partition(separate_by_table, 13)
# Create a stream for each table
for table, id in TABLE_IDS.items():
by_table[id] | 'write to %s' % table >> WriteToBigQuery(known_args.table % table)
result = p.run()
result.wait_until_finish()
if __name__ == '__main__':
logger.setLevel(logging.INFO)
run()
It works fine but after some time (2-3 days) it stops streaming for some reason.
When I check job status, it contains no errors in the logs section (you know, ones marked with red "!" in dataflow's job details). If I cancel the job and run it again - it starts working again, as usual.
If I check Stackdriver for additional logs, here's all Errors that happened:
Here's some warnings that occur periodically while job executes:
Details of one of them:
{
insertId: "397122810208336921:865794:0:479132535"
jsonPayload: {
exception: "java.lang.IllegalStateException: Cannot be called on unstarted operation.
at com.google.cloud.dataflow.worker.fn.data.RemoteGrpcPortWriteOperation.getElementsSent(RemoteGrpcPortWriteOperation.java:111)
at com.google.cloud.dataflow.worker.fn.control.BeamFnMapTaskExecutor$SingularProcessBundleProgressTracker.updateProgress(BeamFnMapTaskExecutor.java:293)
at com.google.cloud.dataflow.worker.fn.control.BeamFnMapTaskExecutor$SingularProcessBundleProgressTracker.periodicProgressUpdate(BeamFnMapTaskExecutor.java:280)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
"
job: "2018-11-30_10_35_19-13557985235326353911"
logger: "com.google.cloud.dataflow.worker.fn.control.BeamFnMapTaskExecutor"
message: "Progress updating failed 4 times. Following exception safely handled."
stage: "S0"
thread: "62"
work: "c-8756541438010208464"
worker: "beamapp-vitar-1130183512--11301035-mdna-harness-lft7"
}
labels: {
compute.googleapis.com/resource_id: "397122810208336921"
compute.googleapis.com/resource_name: "beamapp-vitar-1130183512--11301035-mdna-harness-lft7"
compute.googleapis.com/resource_type: "instance"
dataflow.googleapis.com/job_id: "2018-11-30_10_35_19-13557985235326353911"
dataflow.googleapis.com/job_name: "beamapp-vitar-1130183512-742054"
dataflow.googleapis.com/region: "europe-west1"
}
logName: "projects/PROJECTNAME/logs/dataflow.googleapis.com%2Fharness"
receiveTimestamp: "2018-12-03T20:33:00.444208704Z"
resource: {
labels: {
job_id: "2018-11-30_10_35_19-13557985235326353911"
job_name: "beamapp-vitar-1130183512-742054"
project_id: PROJECTNAME
region: "europe-west1"
step_id: ""
}
type: "dataflow_step"
}
severity: "WARNING"
timestamp: "2018-12-03T20:32:59.442Z"
}
Here's the moment when it seems to start having problems:
Additional info messages that may help:
According to these messages, we don't run out of memory/processing power etc. The job is run with these parameters:
python -m start --streaming True --runner DataflowRunner --project PROJECTNAME --temp_location gs://BUCKETNAME/tmp/ --region europe-west1 --disk_size_gb 30 --machine_type n1-standard-1 --use_public_ips false --num_workers 1 --max_num_workers 1 --autoscaling_algorithm NONE
What could be the problem here?
This isn't really an answer, more helping identify the cause: so far, all streaming Dataflow jobs I've launched using python SDK have stopped that way after some days, whether they use BigQuery as sink or not. So the cause rather seems to be the general fact that streaming jobs with the python SDK are still in beta.
My personal solution: use the Dataflow templates to stream from Pub/Sub to BigQuery (thus avoiding the python SDK), then schedule queries in BigQuery to periodically treat the data. Unfortunately that might not be appropriate for your use cases.
in my company we are experiencing the same and identical problem, as described by the OP, with a similar use case.
Unfortunately the problem is real, concrete and apparently with a random occurrence.
As a workaround, we are considering rewriting our pipeline using the java SDK.
I had a similar issue to this and found that the warning logs contained python Stack trace hidden in the java logs advising of errors.
These errors were continually re-tried by workers causing them to crash and completely freeze the pipeline. I initially thought the No. of workers was too low, so scaled up the number of workers, but the pipeline just took longer to freeze.
I ran the pipeline locally and exported the pubsub messages as text and identified they contained dirty data(messages that did not match the BQ table schema) and as I had no exception handling, that seemed to be the cause of the pipeline to freeze.
Adding a function only accept a record where the first key matches the expected column of your BQ Schema fixed my issue and the Dataflow Job has been running with no issues ongoing.
def bad_records(row):
if 'key1' in row:
yield row
else:
print('bad row',row)
|'exclude bad records' >> beam.ParDo(bad_records)

Looking for a boto3 python example of injecting a aws pig step into an already running emr?

I'm looking for a good BOTO3 example of an AWS EMR already running and I wish to inject a Pig Step into that EMR. Previously, I used the boto2.42 version of:
from boto.emr.connection import EmrConnection
from boto.emr.step import InstallPigStep, PigStep
# AWS_ACCESS_KEY = '' # REQUIRED
# AWS_SECRET_KEY = '' # REQUIRED
# conn = EmrConnection(AWS_ACCESS_KEY, AWS_SECRET_KEY)
# loop next element on bucket_compare list
pig_file = 's3://elasticmapreduce/samples/pig-apache/do-reports2.pig'
INPUT = 's3://elasticmapreduce/samples/pig-apache/input/access_log_1'
OUTPUT = '' # REQUIRED, S3 bucket for job output
pig_args = ['-p', 'INPUT=%s' % INPUT,
'-p', 'OUTPUT=%s' % OUTPUT]
pig_step = PigStep('Process Reports', pig_file, pig_args=pig_args)
steps = [InstallPigStep(), pig_step]
conn.run_jobflow(name='prs-dev-test', steps=steps,
hadoop_version='2.7.2-amzn-2', ami_version='latest',
num_instances=2, keep_alive=False)
The main problem now is that, BOTO3 doesn't use: from boto.emr.connection import EmrConnection, nor from boto.emr.step import InstallPigStep, PigStep and I can't find an equivalent set of modules?
After a bit of checking, I've found a very simple way to inject Pig Script commands from within Python using the awscli and subprocess modules. One can import awscli & subprocess, and then encapsulate and inject the desired PIG steps to an already running EMR with:
import awscli
import subprocess
cmd='aws emr add-steps --cluster-id j-GU07FE0VTHNG --steps Type=PIG,Name="AggPigProgram",ActionOnFailure=CONTINUE,Args=[-f,s3://dev-end2end-test/pig_scripts/AggRuleBag.pig,-p,INPUT=s3://dev-end2end-test/input_location,-p,OUTPUT=s3://end2end-test/output_location]'
push=subprocess.Popen(cmd, shell=True, stdout = subprocess.PIPE)
print(push.returncode)
Of course, you'll have to find your JobFlowID using something like:
aws emr list-clusters --active
Using the same subprocess and push command above. Of course you can add monitoring to your hearts delight instead of just a print statement.
Here is how to add a new step to existing emr cluster job flow for a pig job sing boto3
Note: your script log file, input and output directories should have
the complete path in the format
's3://<bucket>/<directory>/<file_or_key>'
emrcon = boto3.client("emr")
cluster_id1 = cluster_status_file_content #Retrieved from S3, where it was recorded on creation
step_id = emrcon.add_job_flow_steps(JobFlowId=str(cluster_id1),
Steps=[{
'Name': str(pig_job_name),
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['pig', "-l", str(pig_log_file_full_path), "-f", str(pig_job_run_script_full_path), "-p", "INPUT=" + str(pig_input_dir_full_path),
"-p", "OUTPUT=" + str(pig_output_dir_full_path) ]
}
}]
)
Please see screenshot to monitor-