Turn on/off AWS EMR clusters - amazon-web-services

How can I turn on/off EMR clusters? There is only one possibility to terminate permanently. What if I do not need the cluster at nights and I do not want to create a new cluster every morning?

You can't do this. Stopping an EMR cluster is not supported. You simply terminate it when you don't need it.
To protect your data, you should be using EMRFS which allows EMR cluster to read data from S3. This way, there is no need to copy any data from S3 to HDFS.

You can enable scale up\scale down policies available in EMR UI and resize your cluster based on multiple metrics, i.e. ram\cpu utilization. You can also create external job that will send to EMR scale up\scale down command via awscli and you can schedule such jobs to run in the morning and in the evening.
From my experience resizing works well on task nodes while resizing core nodes demands HDFS sync that works only if you don't run any tasks on your EMR.

Related

Is it a good practice to have an AWS EMR standing cluster always running structured streaming?

I have a Spark Structured Streaming job which takes data as input from AWS MSK (Kafka) and write to AWS S3. Is it a good idea to have a standing AWS EMR cluster always running the same? Or are there better ways to manage this infrastructure?
Please let me know if you need further details.
You need some worker pool that is consuming and writing.
Your other options include using EKS instead or YARN on EMR to run Spark, or you could not use Spark and use Kafka Connect S3 Sink instead on an EC2/EKS cluster.

Is it possible to specify the number of mappers-reducers while using s3-dist-cp?

I'm trying to copy data from an EMR cluster to S3 using s3-distcp. Can I specify the number of reducers to a greater value than the default so as to fasten my process?
For setting up number of reducers, you can use the property mapreduce.job.reduces similar to below:
s3-dist-cp -Dmapreduce.job.reduces=10 --src hdfs://path/to/data/ --dest s3://path/to/s3/
Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by subsequent steps in your Amazon EMR cluster.
You can call S3DistCp by adding it as a step in your existing EMR cluster. Steps can be added to a cluster at launch or to a running cluster using the console, AWS CLI, or API.
So you control the number of workers during EMR cluster creation or you can resize existing cluster. You can check exact steps in EMR docs.

Want to clear Big picture about the AWS Glue

I want to clear big picture about the aws Glue regarding some of the following aspects.
How AWS Glue prepare and provision its infrastructure? However it's serverless but how does it manage it?
How it's using apache spark and hadoop to solve so many ETL jobs at a time, Almost jobs of hundreds of AWS Glue customers from every region.
Thanks
AWS Glue uses EMR underneath. It spawns a new cluster with required number of executors (depending on configured DPU) when a new job starts. However, to improve cold start time they have a buffer of already provisioned EMR clusters for the most common number of DPUs. To manage all this they have a set of automated services that monitor state of each cluster, start a new ones etc.

Creating an AWS Data Pipeline EMR cluster using ShellCommandActivity

When I create an AWS EMR I can do so through their simple wizard on the AWS Management Console. Once it's completed I can test it out and when I'm happy with it's configuration I can simply click the AWS CLI Export button and copy the CLI command that creates the EMR.
I need to create an EMR as part of my AWS Data Pipeline process and rather than configure the EmrCluser and then running whatever EmrActivity I want I'm wondering if I could just copy my CLI command I exported during my testing and paste it inside a ShellCommandActivity which will create the EMR. From there I could use either an EmrActivity to do some processing or just use the ShellCommandActivity to do the processing.
Can I create my AWS Data Pipeline EMR Cluster using a CLI command that's run through a ShellCommandActivity? And if I do so will I be able to run an EmrActivity against that EMR Cluster? I just think it would be easier to create the EMR this way because I can use the AWS Management Console to create my EMR and then I can test my EMR before exporting the CLI command rather than going through the process of properly constructing the EMR through the AWS Data Pipeline wizard/JSON process. I.E., The actual EMR wizard on the AWS Management Console is way easier than the Data Pipeline wizard for creating the EMR on the AWS Management Console, especially when it comes to choosing my security groups and various configurations.
Update:
I just verified that I can in fact run a CLI command through the ShellCommandActivity to create my EMR through the Data Pipeline but is this possibly a code smell or bad practice? Are there any downfalls to creating and EMR on the Data Pipeline this way rather than doing it through the predefined EmrCluster command?
It's possible, but a little complicated:
The following action or the script itself would have to wait for the cluster to be created. Make sure the action does not time out.
The data pipeline does not know about the cluster, hence you need to specify a workerGroup instead of runsOn in the EMRActivity. You also need to install Task Runner on the cluster.

Possibility of taking snapshot of AWS EMR cluster or namenode

I am new with AWS services and trying some use-cases. I want to create EMR clusters on demand with some predefined configurations and applications/scripts installed. I was planning to create a snapshot of existing EMR cluster or at-least namenode initially and then use it every-time whenever I want to create other clusters. But after some Google search, I couldn't find any way to capture snapshot of EMR cluster. Is it possible to create snapshot ? or any other alternate way that can help me out with my use-case.
Appreciate any kind of help.
Thanks
It is not possible to create a snapshot of an EMR cluster node and you cannot use a custom AMI when running a cluster. However you can install software on the cluster nodes at the cluster creation time using custom bootstrap actions. You can create your custom bootstrap scripts and use them every time you launch a new cluster. This way you can achieve a similar functionality with the one you are seeking.
For more information using bootstrap actions on EMR please visit: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html#bootstrapCustom
Let us know if you need any further assistance.