Architect a Cloudera CDH cluster on AWS: instances and storage

Architect a Cloudera CDH cluster on AWS: instances and storage - amazon-web-services

I have some doubts about a deployment of CDH on AWS. I read the reference architecture doc and other material I found on Cloudera Engineering Blog but I need some more suggestions about it.
1) Is the CDH deployment available only for some kind of instances or I can deploy it on all the AWS instance types?
2) Assuming I want to create a cluster that will be active 24x7. For a long-running cluster I understood it's better to have a cluster based on local-storage instances. If we consider a cluster of 2PBs I think that d2.8xlarge should be the best choice for the datanodes. About the Master Nodes: - if I want to deploy only 3 Master Nodes, is it better to have them as local-storage instances too or as EBS attached instances to be able to react quickly to a possible Master Node failure? - are there some best practice about the master node instance type (EBS or local-storage)? About the Data Nodes: - if a data node fails, Has the CDH some automated mechanism to automatically spin-up a new instance and connect it to the cluster in order to restore the cluster without down-times? Have we to build a script from scratch to do this thing? About the Edge Nodes: - are there some best practice about the instance type (EBS or local-storage)?
3) If I want to do a backup of the cluster on S3: - when I do a distcp from the CDH to S3, can I move the data directly on Glacier instead of the normal S3? If I have some compression applied on the data (e.g. snappy, gzip, etc.) and I do a distcp to S3: - Is the space occupied on S3 the same or the distcp command decompress the data for the copy?
If I have a cluster based on EBS attached instances: - is it possible to snapshot the disks and re-attach a datanode with the EBS disks rebuilt from the snapshot?
4) If I have the Data Nodes deployed as r4.8xlarge and I need more horsepower, is it possible to scale-up the cluster from r4.8xlarge to a r4.16xlarge on-the-fly? Attaching and detaching the disks in few mins?
Thanks a lot for the clarifications, I hope my doubts will help also other users.

1) There's no explicit restriction on instance types where CDH components will work, but you'd need to pick types with a minimum of horsepower. For example, I don't expect that a micro size instance would work for much of anything. A type that is too small will generally cause daemons to run out of memory. The reference architecture has suggested instance types for certain situations.
2) You should stick with EBS for the root volume of instance types. There are a few reasons, including that newer instance types don't even support local instance storage for the root disk.
CDH doesn't have a mechanism for replacing data nodes when they fail. You could roll something yourself, possibly with help from Cloudera Director.
3) You can set up lifecycle rules for data in S3 to migrate it from the standard storage class into Glacier over time, or you can just write directly to Glacier; it doesn't look like direct Glacier access can be done through the s3a connector. I'm pretty sure distcp and S3 won't fiddle with compression; what you copy is opaque to S3 for sure. You can snapshot EBS volumes (root or additionally attached), then detach them and re-attach them to a different instance; this isn't necessarily a great way to back up datanodes vs. the distcp route, because each datanode is unique and has changing data as the cluster runs.
4) You can resize EBS-backed EC2 instances without detaching and re-attaching disks. You do have to stop an instance to resize it.

Point 3 only:
You need to distcp to S3 and move that to glacier via the AWS settings
It doesn't do anything to the data, compression, etc.
see the (hortonworks doc) Distcp and S3 and read its warnings/caveats. In particular, incremental distcp isn't checksum-based, atomic distcp isn't, it's just "really slow distcp"

Related

How to take a backup of EC2 instance in AWS and move to a low cost alternative?

We have an EC2 instance running in AWS EC2 instance. We have our ML algorithms and data that. We have also hosted a web-based interface also in that machine.
Now there are no new developments happening in that EC2 instance. We would like to terminate AWS subscription for a short period of time (for the purpose of cost-reduction and exploring new cloud services). Most importantly, we want to be in a position where we can purchase a new EC2 instance with a fresh AWS subscription, use the backup which we take now, and resume all operations (web-backend, SMS services for our app which is hosted in AWS, etc.).
What is the best way to do it? Is temporary termination of AWS subscription advisable?

There is no concept of an "AWS Subscription". AWS is charged on-demand, which means you only pay when you use resources.
If you temporarily do not want the Amazon EC2 instance, you could:
Stop the instance, which is like turning off the power. You will not be charged for the instance, but you will still pay for the disk storage attached to the instance. You can simply Start the instance again when you wish to use it. You will only be charged while the instance is running. OR
Create an image of the instance, then terminate the instance. This will create an Amazon Machine Image (AMI), which contains a copy of the disks. You can then launch a new Amazon EC2 instance from the AMI when you wish to use it again. This is a lower-cost option compared to simply stopping the instance, but it takes more effort to stop/start.
It is quite common for companies to stop Amazon EC2 instances at night or over the weekend to reduce costs while they are not needed.

EDIT: Just thought of a third option. Will test it and be back. Not worth it; it would involve creating an image from the EC2 instance and then convert that image to a VM image, storing the VM image in S3. There may be some advantages to this, but I do not see them.
I think you have two options, both of them very reasonably priced. If you can separate the data from the operating system, then your best option would be to use an S3 bucket as a file system within the EC2 instance. Your EC2 instance would use this bucket to store all your "ML algorithms and data" and, possibly, even your "web-based interface". Whenever you decide that you no longer need the processing capacity of the EC2, you would unmount the S3 bucket file system from the EC2 instance and terminate that instance. After configuring an appropriate lifecycle rule for the S3 bucket, it would transition to Glacier, or even Glacier Deep Archive [you must considerer the different options of long term storage]. In the future, whenever you want to work with your data again, you would move your data from Glacier back to S3, create a new EC2 instance, install your applications, mount your S3 bucket as a file system and you would have access to all your data. I think this is your least expensive and shortest recovery time objective option. To implement this option, look at my answer to this question; everything you need to use an S3 bucket as a regular folder inside the EC2 instance is there.
The second option provides an integrated solution, meaning the operating system and the data stay together, and allows you to restore everything as it was the day you stopped processing your data. It's made up of the following cycle:
Shutdown your EC2 and make a note of all the specs [you need them further down].
Export your instance to a virtual image, vmdk for example, and store it in your S3 bucket. Something like this:
aws ec2 create-instance-export-task --instance-id i-0d54b0682aa3998a0
--target-environment vmware --export-to-s3-task DiskImageFormat=VMDK,ContainerFormat=ova,S3Bucket=sm-vm-backup,S3Prefix=vms
Configure an appropriate lifecycle rule for the S3 bucket so that it transitions to Glacier, or even Glacier Deep Archive.
Terminate the EC2 instance.
In the future you will need to implement the inverse, so you will need to restore the archived S3 Object [make sure you you can live with the time needed by AWS to do this]
Import the virtual image as an EC2 AMI, something like this [this is not complete - you will need some more options that you saved above]:
aws ec2 import-image --disk-containers
Format=ova,UserBucket="{S3Bucket=sm-vm-backup,S3Key=vmsexport-i-0a1c382e740f8b0ee.ova}"
Create an EC2 instance based on the image and you're back in business.
Obviously you should do some trial runs and even automate the entire process if it's something that will be done frequently. I have a feeling, based on what you said, that the first option is a better option, provided you can easily install whatever applications they use.

I'm assuming that you launched an EC2 instance from a base Amazon Machine Image and then added your own software and models to it. As opposed to launched an EC2 instance from an AWS Marketplace offering.
The simplest thing to do is to create an Amazon Machine Image (AMI) from your running EC2 instance. That will capture the current state of the instance and persist it in your AWS account. Then you can terminate the instance. Later, when you want to recreate it, launch a new instance, selecting the saved AMI instead of a standard AMI.
An alternative is to avoid the need to capture machine state at all, by using standard DevOps practices to revision-control everything you need to recreate the state of a running machine.
Note that there are costs associated with an AMI, though they are minimal ($0.05 per GB-month of data stored, for example).

I had contacted AWS customer care regarding this issue. Given below is the response I received. Please add your comments on which option might be good for me.
Note: I acknowledge the AWS customer care team for their help.
I understand that you require some information on cost saving for your
Instance since you will not be utilizing the service for a while.
To assist you with this I would recommend checking out the Instance
Stop/Start link here:
==>https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Stop_Start.html .
When you stop an Instance, you do not lose any data & you are not
charged for the resources any further. However please keep in mind
that you will still be charged for any EBS Storage Volumes attached to
the stopped Instance(s).
I also recommend checking out the below links on how you can reduce
your costs.
==>https://aws.amazon.com/premiumsupport/knowledge-center/reduce-aws-bill/
==>https://aws.amazon.com/blogs/compute/10-things-you-can-do-today-to-reduce-aws-costs/
That being said, please note that as I am in the billing department,
for the best assistance with the various plans you will require the
assistance of our Sales Team.
The Sales Team will be able to assist with ways to save while
maintaining your configurations.
You will be able to reach the Sales Team here:
==>https://aws.amazon.com/websites/contact-us/.
Once you have completed the details in the link, a member of the team
will be in touch with you at their soonest.

Multi region EC2 & RDS replication from Region A to various other regions

Our current server consisting of an 2x EC2 instances and RDS (Read/Write) database is in Mumbai Region. However I would like to copy everything (2x EC2 & RDS (R/W)) across to Sydney, and other to other regions.
Ideally I would like to replicate the contents in those instances as well.
Does anyone know a quick and easy way of doing this?
Edit 25/01/2019:
However I would like to copy everything including what ever is inside the instances (2x EC2s and the RDSs)
Edit 29/01/2019:
The purpose is to "scale/expand out". I want to have the same infrastructure replicated 1-to-1 (exactly/identically) across various regions.

It is simple!
- For EC2 - you need to create an AMI of those instances then right click on the AMI you've just created and choose "copy AMI" to the designated region.
For RDS
If you just wanna copy data to another region then take a snapshot then copy that snapshot to destination region
If you want to make the RDS replicate to another region continuously then you need to create a read-replica from your RDS instance.

Option for replicating environment depends on how much downtime can you tolerate.
If you are okay with downtime
1. Copy the AMI of EC2 instance and snapshot of RDS to another regions
2. Bring up your new environment.
This is perfect for non critial workload
If this is critical application
1. Copy the AMI of ec2 instance ( I am assuming this would be your web/app instnaces) For real time replication use rsync or robocopy .. or solution like cloudendure .
2. Create a new RDS instance in sydney
3. USE DMS migration tool .. create source and target relationship
4. once insync cut off the relation bring new environment in sydney

As suggested by previous answers for EC2 you can create AMIs and then move the AMI to a different region.
For RDS, you can either create read replicas (and read replicas of read replicas, but beware of latency), read replicas are used to mainly improve read performance of your app.
You can also create a Multi AZ backup which will act as a disaster recovery site. However, note that Multi-AZ is only used in case of a failover. Moreover, Multi-AZ involves Synchronous data copy and read replicas are asynchronous, so read replicas can demonstrate eventual consistency behavior.
But the real question here is - What are you trying to achieve?
Are you trying to "scale out" your infrastructure to support huge traffic to your application? Or are you simply trying to setup disaster recovery (DR)?
If your answer is DR, then the approach is pretty straight forward with Multi AZ and EC2 instance snapshots. But if the answer is scaling out and performance, you really need to be thinking of better strategies such as using Cloudfront (CDN) if it is a web app, using Elasticache in-memory cache for frequently read data, or RDS read replicas, using Elastic Load Balancers with Dynamic/Step scale-out/scale-in. Other, methods would be to evaluate the type of RDS storage subsystem used i.e. using Provisional IOPs vs. Using General Purpose SSD, checking if there are any NAT “instance” bottlenecks in your VPC and so on.
It may be tempting to spin up all these redundant copies of EC2 AMIs or RDS read replicas with a click of a button, but you really need to be thinking about the cost you are going to incur on a monthly basis for completely un-used resources.

AWS Auto-Scaling

I'm trying AWS auto-scaling for the first time, as far as I understand it creates instances if for example my CPU Utilization reaches critical level, that I define.
So I am curious, after I lunch my instance I spend a fair amount of time configuring it and copying the data, if AWS auto-scales my instance how will it configure the new instances and move the data to it?

You can't store any data that you want to keep on an instance that is part of an autoscaling group (well you can, but you will lose it).
There are (at least) two ways to answer your question:
Create a 'golden image', in other words spin-up an instance, configure it, install the software etc and then save it as an AMI (amazon machine image). Then tell the autoscaling group to use that AMI each time an instance starts - it will be pre-configured when it starts.
Put a script on the instance that tells the instance how to configure itself when it starts up (in the user data). SO basically each time an instance scales up, it runs the script and does all the steps it needs to to configure itself.
As for you data, best practice would be to store any data you want to keep in a database or object store that is not on the instance - so something like RDS, DynamoDB or even S3 objects.

You could also use AWS EFS, store there your data/scripts that the EC2 Instances will be sharing, and automatically mount it every time a new EC2 Instance is created via /etc/fstab.
Once you have configured the EFS to be mounted on the EC2 Instance (/etc/fstab), you should create a new AMI, and use this new AMI to create a new Launch Configuration and AutoScaling Group, so that the new Instances automatically mount your EFS and are able to consume that shared data.
https://aws.amazon.com/efs/faq/
Q. What use cases is Amazon EFS intended for?
Amazon EFS is designed to provide performance for a broad spectrum of
workloads and applications, including Big Data and analytics, media
processing workflows, content management, web serving, and home
directories.
Q. When should I use Amazon EFS vs. Amazon Simple Storage Service (S3)
vs. Amazon Elastic Block Store (EBS)?
Amazon Web Services (AWS) offers cloud storage services to support a
wide range of storage workloads.
Amazon EFS is a file storage service for use with Amazon EC2. Amazon
EFS provides a file system interface, file system access semantics
(such as strong consistency and file locking), and
concurrently-accessible storage for up to thousands of Amazon EC2
instances. Amazon EBS is a block level storage service for use with
Amazon EC2. Amazon EBS can deliver performance for workloads that
require the lowest-latency access to data from a single EC2 instance.
Amazon S3 is an object storage service. Amazon S3 makes data available
through an Internet API that can be accessed anywhere.
https://docs.aws.amazon.com/efs/latest/ug/mount-fs-auto-mount-onreboot.html
You can use the file fstab to automatically mount your Amazon EFS file
system whenever the Amazon EC2 instance it is mounted on reboots.
There are two ways to set up automatic mounting. You can update the
/etc/fstab file in your EC2 instance after you connect to the instance
for the first time, or you can configure automatic mounting of your
EFS file system when you create your EC2 instance.

I recommend using a shared data container if it is data that is updated and the updated data is needed by all instances that might be spinning up.
If it is database data or you could store the needed data in a database I would consider using an RDS.
If it is static data only used to configure the instances like dumps or configuration files which are not updated by running instances then I would recommend pulling them from CloudFlare or S3 of iT is not possible to pull them from a repository.
Good luck

Amazon EC2 spot instance fleet using AMI and common files

I would like to launch multiple Amazon EC2 spot instances (fleet?) using a custom AMI (docker?) for performing a deep-learning training task. I would like all the instances to share a common set of files for the purposes of training the model.
The idea here is not to lose training history and keep a backup in EBS (network drive?) when the spot instance is terminated by AWS due to pricing-limit/demand. The task state can be updated in a file and then resumed when instances are available.
Is it possible to launch all instances and let them work cooperatively to complete the training task? What kind of a setup could accomplish this?

Firstly, you might be interested in the Deep Learning AMI from the AWS Marketplace, which comes fully-configured with popular Deep Learning tools.
If the software you are using wishes to save its data to a local file system (as opposed to Amazon S3), then you could use Deep Learning AMI to share a file system amongst multiple Amazon EC2 instances (including Spot instances). Amazon EFS is similar to a NAS and can be used simultaneously across multiple instances.
The EFS volume could be mounted via a User Data script, together with a setup script to load and run your desired application (which can be easier than making a new AMI).

How to load ESB Volume by ID via .ebextensions

I'm trying to mount the same volume for a Beanstalk build but can't figure out how to make it work with the volume-id.
I can attach a new volume, and I can attach one based on a snapshot ID but neither are what I'm after.
My current .ebextension
commands:
01umount:
command: "umount /dev/sdh"
ignoreErrors: true
02mkfs:
command: "mkfs -t ext3 /dev/sdh"
03mkdir:
command: "mkdir -p /media/volume1"
ignoreErrors: true
04mount:
command: "mount /dev/sdh /media/volume1"
option_settings:
- namespace: aws:autoscaling:launchconfiguration
option_name: BlockDeviceMappings
value: /dev/sdh=:20
Which of course will mount a new volume, not attach an existing one. Perhaps snapshot is what I want and I just don't understand the terminology here?
I need the same data that was on the volume when the autoscaling kicks in to be on each EC2 instants that scales... A snapshot would surely just be the data that existed at the point the snapshot was created?
Any ideas or better approaches?

Elastic Block Store (EBS) allows you to create, snapshot/clone, and destroy virtual hard drives for EC2 instances. These drives ("volumes") can be attached to and detached from EC2 instances, but they are not a "share" or shared volume... so attaching a volume by ID becomes a non-useful idea after the first instance launched.
EBS volumes are hard drives. The analogy is imprecise (because they're on a SAN) but much the same way as you can't physically install the same hard drive in multiple servers, you can't attach an EBS volume to multiple instances (SAN != NAS).
Designing with a cloud mindset, all of your fixed resources would actually be on the snapshot (disk image) you deploy when you release a new version and then use to spawn each fresh auto-scaled instance... and nothing persistent would be stored there because -- just as important as scaling up, is scaling down. Autoscaled instances go away when not needed.
AWS has Simple Storage Service (S3) which is commonly used for storing things like documents, avatars, images, videos, and other resources that need to be accessible in a distributed environment. It is not a filesystem, and can't properly be compared to a filesystem, because it's an object store... but is a highly scalable and highly available storage service that is well-suited to distributed applications. s3fs allows an S3 "bucket" to be mounted into your machine's filesystem, but this is no panacea. That mechanism should be reserved for back-end process use, if you use it at all, because it's not appropriate for resources like code or templates, and will not perform as well for serving up content as S3 will perform if used as designed, with clients directly accessing it over https. You can secure the content through more than one mechanism, as documented.
AWS also now has Elastic File System (EFS) which sets up an array of storage that you can mount from all of your machines, using NFS. AWS provides the NFS server and the back-end storage. Unlike EBS, you do not need to know how much storage to provision up front, because it scales up and down based on what you've stored, billing you This service is still in "preview" as of this writing, so should not be used for production data.
Or, you can manually configure your own NFS server and mount it from the autoscaling machines. Making such as setup fail-safe is a bit tricky, though.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js