Creating an AWS Data Pipeline EMR cluster using ShellCommandActivity - amazon-web-services

When I create an AWS EMR I can do so through their simple wizard on the AWS Management Console. Once it's completed I can test it out and when I'm happy with it's configuration I can simply click the AWS CLI Export button and copy the CLI command that creates the EMR.
I need to create an EMR as part of my AWS Data Pipeline process and rather than configure the EmrCluser and then running whatever EmrActivity I want I'm wondering if I could just copy my CLI command I exported during my testing and paste it inside a ShellCommandActivity which will create the EMR. From there I could use either an EmrActivity to do some processing or just use the ShellCommandActivity to do the processing.
Can I create my AWS Data Pipeline EMR Cluster using a CLI command that's run through a ShellCommandActivity? And if I do so will I be able to run an EmrActivity against that EMR Cluster? I just think it would be easier to create the EMR this way because I can use the AWS Management Console to create my EMR and then I can test my EMR before exporting the CLI command rather than going through the process of properly constructing the EMR through the AWS Data Pipeline wizard/JSON process. I.E., The actual EMR wizard on the AWS Management Console is way easier than the Data Pipeline wizard for creating the EMR on the AWS Management Console, especially when it comes to choosing my security groups and various configurations.
Update:
I just verified that I can in fact run a CLI command through the ShellCommandActivity to create my EMR through the Data Pipeline but is this possibly a code smell or bad practice? Are there any downfalls to creating and EMR on the Data Pipeline this way rather than doing it through the predefined EmrCluster command?

It's possible, but a little complicated:
The following action or the script itself would have to wait for the cluster to be created. Make sure the action does not time out.
The data pipeline does not know about the cluster, hence you need to specify a workerGroup instead of runsOn in the EMRActivity. You also need to install Task Runner on the cluster.

Related

How to run PySpark on AWS EMR with AWS Lambda

How may I make my PySpark code to run with AWS EMR from AWS Lambda? Do I have to use AWS Lambda to create an auto-terminating EMR cluster to run my S3-stored code once?
You need transient cluster for this case which will auto terminate once your job is completed or the timeout is reached whichever occurs first.
You can access this link on how to initialise the same.
What are the processes available to create a EMR cluster:
Using boto3
/ AWS
CLI
/ Java
SDK
Using cloudformation
Using Data Pipeline
Do I have to use AWS Lambda to create an auto-terminating EMR cluster to run my S3-stored code once?
No. It isn’t mandatory to use lambda to create an auto-terminating cluster.
You just need to specify a flag --auto-terminate while creating a cluster using boto3 / CLi / Java-SDK. But this case you need to submit the job along with cluster config. Ref
Note:
Its not possible to create an auto-terminating cluster using cloudformation. By design, CloudFormation assumes that the
resources that are being created will be permanent to some extent.
If you REALLY had to do it this way, you could make an AWS api call to
delete the CF stack upon finishing your EMR tasks.
How may I make my PySpark code to run with AWS EMR from AWS Lambda?
You can design your lambda to submit spark
job.
You can find an example
here
In my use case I have one parameterised lambda which invoke CF to create cluster, submit job and terminate cluster.

Is it possible to specify the number of mappers-reducers while using s3-dist-cp?

I'm trying to copy data from an EMR cluster to S3 using s3-distcp. Can I specify the number of reducers to a greater value than the default so as to fasten my process?
For setting up number of reducers, you can use the property mapreduce.job.reduces similar to below:
s3-dist-cp -Dmapreduce.job.reduces=10 --src hdfs://path/to/data/ --dest s3://path/to/s3/
Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by subsequent steps in your Amazon EMR cluster.
You can call S3DistCp by adding it as a step in your existing EMR cluster. Steps can be added to a cluster at launch or to a running cluster using the console, AWS CLI, or API.
So you control the number of workers during EMR cluster creation or you can resize existing cluster. You can check exact steps in EMR docs.

How to run glue script from Glue Dev Endpoint

I have a glue script (test.py) written say in a editor. I connected to glue dev endpoint and copied the script to endpoint or I can store in S3 bucket. Basically glue endpoint is an EMR cluster, now how can I run the script from the dev endpoint terminal? Can I use spark-submit and run it ?
I know we can run it from glue console,but more interested to know if I can run it from glue end point terminal.
You don't need a notebook; you can ssh to the dev endpoint and run it with the gluepython interpreter (not plain python).
e.g.
radix#localhost:~$ DEV_ENDPOINT=glue#ec2-w-x-y-z.compute-1.amazonaws.com
radix#localhost:~$ scp myscript.py $DEV_ENDPOINT:/home/glue/myscript.py
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT
...
[glue#ip-w-x-y-z ~]$ gluepython myscript.py
You can also run the script directly without getting an interactive shell with ssh (of course, after uploading the script with scp or whatever):
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT gluepython myscript.py
If this is a script that uses the Job class (as the auto-generated Python scripts do), you may need to pass --JOB_NAME and --TempDir parameters.
For development / testing purpose, you can setup a zeppelin notebook locally, have an SSH connection established using the AWS Glue endpoint URL, so you can have access to the data catalog/crawlers,etc. and also the s3 bucket where your data resides.
After all the testing is completed, you can bundle your code, upload to an S3 bucket. Then create a Job pointing to the ETL script in S3 bucket, so that the job can be run, and scheduled as well.
Please refer here and setting up zeppelin on windows, for any help on setting up local environment. You can use dev instance provided by Glue, but you may incur additional costs for the same(EC2 instance charges).
Once you set up the zeppelin notebook, you can copy the script(test.py) to the zeppelin notebook, and run from the zeppelin.
According to AWS Glue FAQ:
Q: When should I use AWS Glue vs. Amazon EMR?
AWS Glue works on top of the Apache Spark environment to provide a
scale-out execution environment for your data transformation jobs. AWS
Glue infers, evolves, and monitors your ETL jobs to greatly simplify
the process of creating and maintaining jobs. Amazon EMR provides you
with direct access to your Hadoop environment, affording you
lower-level access and greater flexibility in using tools beyond
Spark.
Do you have any specific requirement to run Glue script in an EMR instance? Since in my opinion, EMR gives more flexibility and you can use any 3rd party python libraries and run directly in a EMR Spark cluster.
Regards

Security-Configuration Field For AWS Data Pipeline EmrCluster

I created an AWS EMR Cluster through the regular EMR Cluster wizard on the AWS Management Console and I was able to select a security-configuration e.g., when you export the CLI command it's --security-configuration 'mySecurityConfigurationValue'.
I now need to create a similar EMR through the AWS Data Pipeline but I don't see any options where I can specify this security-configuration field.
The only similar fields I see are EmrManagedSlaveSecurityGroup, EmrManagedMasterSecurityGroup, AdditionalSlaveSecurityGroups, AdditionalMasterSecurityGroups, and SubnetId. I already have all of those filled out in my Pipeline configuration but I just need to also specify the security-configuration. Any thoughts?
Unfortunately, DataPipeline does not support the Security Configurations feature (as well as other features that were introduced in the EMR 5.x versions like using a custom AMI).
One solution for this is to:
Replace the EmrCluster in your pipeline with an EC2 resource
Use a ShellCommandActivity on the EC2 resource to run the aws emr create-cluster CLI command
Use a bootstrap step to install TaskRunner on the cluster
Replace all the runsOn properties in your pipeline with workerGroup so the tasks run on the EMR cluster you created in step 2
Add a final ShellCommandActivity at the end of the pipeline to terminate the cluster using CLI
Now since you are spinning up your cluster using the CLI you have access to all kinds of features like security configurations, custom AMI, instance fleets, etc. and you can still orchestrate the tasks using DataPipeline.

Using AWS Data pipeline - EMR vs EC2

I would like to use AWS Data Pipeline to execute an ETL process.
Suppose that my process has a small input file and I am would like to use a custom jar or python script to make data transformations. I dont see any reason to use a cluster EMR to make this simple data step. So, I would like to execute this data step in a EC2 single instance.
Looking at the AWS DataPipeline at EMRActivity object, i just see the option to run using an EMR cluster.
Is there way to run a computation step inside a EC2 instance?
Is it th best solution for this use case?
Or Is it better to setup a small EMR (with a single node) and execute a hadoop job?
If you don't need the EMR cluster or Hadoop framework and your execution can easily run on a single instance than you can use the ShellCommandActivity associated with an Ec2Resource (an instance) to perform the work. Simple example is at http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-getting-started.html