Using AWS Data pipeline - EMR vs EC2 - amazon-web-services

I would like to use AWS Data Pipeline to execute an ETL process.
Suppose that my process has a small input file and I am would like to use a custom jar or python script to make data transformations. I dont see any reason to use a cluster EMR to make this simple data step. So, I would like to execute this data step in a EC2 single instance.
Looking at the AWS DataPipeline at EMRActivity object, i just see the option to run using an EMR cluster.
Is there way to run a computation step inside a EC2 instance?
Is it th best solution for this use case?
Or Is it better to setup a small EMR (with a single node) and execute a hadoop job?

If you don't need the EMR cluster or Hadoop framework and your execution can easily run on a single instance than you can use the ShellCommandActivity associated with an Ec2Resource (an instance) to perform the work. Simple example is at http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-getting-started.html

Related

Turn on/off AWS EMR clusters

How can I turn on/off EMR clusters? There is only one possibility to terminate permanently. What if I do not need the cluster at nights and I do not want to create a new cluster every morning?
You can't do this. Stopping an EMR cluster is not supported. You simply terminate it when you don't need it.
To protect your data, you should be using EMRFS which allows EMR cluster to read data from S3. This way, there is no need to copy any data from S3 to HDFS.
You can enable scale up\scale down policies available in EMR UI and resize your cluster based on multiple metrics, i.e. ram\cpu utilization. You can also create external job that will send to EMR scale up\scale down command via awscli and you can schedule such jobs to run in the morning and in the evening.
From my experience resizing works well on task nodes while resizing core nodes demands HDFS sync that works only if you don't run any tasks on your EMR.

Is it possible to specify the number of mappers-reducers while using s3-dist-cp?

I'm trying to copy data from an EMR cluster to S3 using s3-distcp. Can I specify the number of reducers to a greater value than the default so as to fasten my process?
For setting up number of reducers, you can use the property mapreduce.job.reduces similar to below:
s3-dist-cp -Dmapreduce.job.reduces=10 --src hdfs://path/to/data/ --dest s3://path/to/s3/
Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by subsequent steps in your Amazon EMR cluster.
You can call S3DistCp by adding it as a step in your existing EMR cluster. Steps can be added to a cluster at launch or to a running cluster using the console, AWS CLI, or API.
So you control the number of workers during EMR cluster creation or you can resize existing cluster. You can check exact steps in EMR docs.

Creating an AWS Data Pipeline EMR cluster using ShellCommandActivity

When I create an AWS EMR I can do so through their simple wizard on the AWS Management Console. Once it's completed I can test it out and when I'm happy with it's configuration I can simply click the AWS CLI Export button and copy the CLI command that creates the EMR.
I need to create an EMR as part of my AWS Data Pipeline process and rather than configure the EmrCluser and then running whatever EmrActivity I want I'm wondering if I could just copy my CLI command I exported during my testing and paste it inside a ShellCommandActivity which will create the EMR. From there I could use either an EmrActivity to do some processing or just use the ShellCommandActivity to do the processing.
Can I create my AWS Data Pipeline EMR Cluster using a CLI command that's run through a ShellCommandActivity? And if I do so will I be able to run an EmrActivity against that EMR Cluster? I just think it would be easier to create the EMR this way because I can use the AWS Management Console to create my EMR and then I can test my EMR before exporting the CLI command rather than going through the process of properly constructing the EMR through the AWS Data Pipeline wizard/JSON process. I.E., The actual EMR wizard on the AWS Management Console is way easier than the Data Pipeline wizard for creating the EMR on the AWS Management Console, especially when it comes to choosing my security groups and various configurations.
Update:
I just verified that I can in fact run a CLI command through the ShellCommandActivity to create my EMR through the Data Pipeline but is this possibly a code smell or bad practice? Are there any downfalls to creating and EMR on the Data Pipeline this way rather than doing it through the predefined EmrCluster command?
It's possible, but a little complicated:
The following action or the script itself would have to wait for the cluster to be created. Make sure the action does not time out.
The data pipeline does not know about the cluster, hence you need to specify a workerGroup instead of runsOn in the EMRActivity. You also need to install Task Runner on the cluster.

Use same EC2 instance for all AWS Data Pipeline activities

I am using AWS Data Pipeline to import some CSV data from S3 to Redshift. I also added a ShellCommandActivity to remove all S3 files after the copy activity completed. I attached a picture with the whole process.
Everything works fine but each activity starts it's own EC2 instance. Is it possible that the ShellCommandActivity to reuse the same EC2 instance as the RedshiftCopyActivity, after the copy command completed?
Thank you!
Unless you can do all activities in shell or CLI, it is not possible to do everything in the same instance.
One suggestion I can give is to move on to new technologies. AWS Data Pipeline is outdated (4 years old). You should use AWS Lambda which will cost you a fraction of what you are paying and you can load the files into Redshift as soon as the files are uploaded to S3. Clean up is automatic and Lambda is much more powerful than AWS Data Pipeline. The tutorial A Zero-Administration Amazon Redshift Database Loader is the one you want. Yes, there is some learning curve, but as the title suggest it is a zero administration load.
In order for the ShellCommandActivity to run on the same EC2 instance, I edited my ShellCommandActivity using Architect and for the Runs On option a chose Ec2Instance. The ShellCommandActivity gets mapped automatically to the same EC2Instance as the RedshiftCopyActivity. Now the whole process looks like this:
Thank you!

Automate AWS instance start and stop

I'm running a instance in amazon AWS and it runs non-stop everyday. I'm using ubuntu ec2 instance which is running Apache, Mirthconnect tool and LAMP server. I want to run this instance only on particular time duration of a day. I prefer not use any additional AWS services such as cloud-watch . Is there a way we could acheive this?.
The major purpose is for using Mirthconnect fetching data from mysql database
There are 3 solutions.
AWS Data Pipeline - You can schedule the instance start/stop just like cron. It will cost you one hour of t1.micro instance for every start/stop
AWS Lambda - Define a lambda function that gets triggered at a pre defined time. Your lambda function can start/stop instances. Your cost will be very minimal or $0
Write a shell script and run it as a cron job or run it on demand. The script will have AWS CLI command to start and stop the instance.
I used Data Pipeline for a long time before moving to Lambda. Data Pipeline is very trivial. Just paste the AWS CLI commands to stop and start instances. Lambda is more involved.
I guess for that you'll need another machine which is on 24x7. On which you can write cron job in python using boto or any other language like bash.
I don't see how you start a instance in stopped state without using any other machine.
Or you can have a simple raspberry pi on at your home which does the ON-OFF work for you using AWS CLI or simple Python. How about that? ;)