EBS storage for EMR - amazon-web-services

Can someone please clarify on what could be an use case for having an EBS volume, in an EMR cluster (transcient / on-demand cluster).
what are the benefits of using an EBS volume in an EMR? since the EBS will be deleted as well, with the termination of an EMR cluster.
Am planning to setup a EMR cluster to run a spark based ETL jobs, and looking for some inputs please. I can go with EMRFS/S3, but just wondering why do we have an EBS in EMR.
Thanks.

Some EC2 instance types supported by EMR do not have any storage other than supporting EBS (e.g., the c4 and m4 series). In this case, the instances will require EBS in order to be used with EMR, and a default volume of 10 GB will be attached to each instance unless you specify a larger volume.
Of course, EBS may also be used with other instances types that do already include storage if you require additional storage beyond what the instance provides.
For more information about EMR and EBS, see https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html

Related

Persist heap dump in case of OOM in kubernetes pod?

I need to persist the heap dump when the java process gets OOM and the pod is restarted.
I have following added in the jvm args
-XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/dumps
...and emptydir is mounted on the same path.
But the issue is if the pod gets restarted and if it gets scheduled on a different node, then we are losing the heap dump. How do I persist the heap dump even if the pod is scheduled to a different node?
We are using AWS EKS and we are having more than 1 replica for the pod.
Could anyone help with this, please?
You will have to persists the heap dumps on a shared network location between the pods. In order to achieve this, you will need to provide persistent volume claims and in EKS, this could be achieved using an Elastic File System mounted on different availability zones. You can start learning about it by reading this guide about EFS-based PVCs.
As writing to EFS is too slow in your case, there is another option for AWS EKS - awsElasticBlockStore.
The contents of an EBS volume are persisted and the volume is unmounted when a pod is removed. This means that an EBS volume can be pre-populated with data, and that data can be shared between pods.
Note: You must create an EBS volume by using aws ec2 create-volume or the AWS API before you can use it.
There are some restrictions when using an awsElasticBlockStore volume:
the nodes on which pods are running must be AWS EC2 instances
those instances need to be in the same region and availability zone as the EBS volume
EBS only supports a single EC2 instance mounting a volume
Check the official k8s documentation page on this topic, please.
And How to use persistent storage in EKS.

How to synchronize data between 2 EBS volumes in AWS?

I have 2 EBS volumes in 2 availability zones in the same region, one is primary and another is backup. Generally, I just read and write data from primary volume. Is it possible to synchronize data from primary to back up EBS volume? if yes, how can I do that?
Thanks
1 year after this question has been posted but I hope it helps anyone looking into this.
Amazon EFS is a great solution. An alternative for what you require is using Snapshots. With AWS Backup you can schedule Amazon EBS snapshots and have them shared across AZ or even different accounts.
As very well proposed in the previous answer, you should first try to understand your performance requirements for the workload and also the RPO and RTO requirements.
Comparing EFS and EBS, I could say that:
A. EFS (Elastic File System) is a managed parallel NFS (based on NFSv4). You are going to mount it as a directory. EFS leverages the same technology as EBS, and the disks are replicated in the AZ and also between AZ. You don’t chose or control the disks, just what performance you expect from the managed service.
B. EBS (Elastic Block Storage) is also network attached but is a block storage, which means that your OS will see it as a disk and not a directory. You have to format it as a file system (or group it with other EBS and create LVM, RAID, etc) before you can use it. EBS are replicated within the same AZ but bot across AZ. You can have snapshots of your EBS and copy them to the other AZ, for example.
So you have to take into account not only the performance you require but also what type of storage (block or file) your application need.
Can you use EFS for this? You might be able to avoid having to replicate the data is you can have the primary and backup instance/applications looking at the same data volume.

Architect a Cloudera CDH cluster on AWS: instances and storage

I have some doubts about a deployment of CDH on AWS. I read the reference architecture doc and other material I found on Cloudera Engineering Blog but I need some more suggestions about it.
1) Is the CDH deployment available only for some kind of instances or I can deploy it on all the AWS instance types?
2) Assuming I want to create a cluster that will be active 24x7. For a long-running cluster I understood it's better to have a cluster based on local-storage instances. If we consider a cluster of 2PBs I think that d2.8xlarge should be the best choice for the datanodes. About the Master Nodes: - if I want to deploy only 3 Master Nodes, is it better to have them as local-storage instances too or as EBS attached instances to be able to react quickly to a possible Master Node failure? - are there some best practice about the master node instance type (EBS or local-storage)? About the Data Nodes: - if a data node fails, Has the CDH some automated mechanism to automatically spin-up a new instance and connect it to the cluster in order to restore the cluster without down-times? Have we to build a script from scratch to do this thing? About the Edge Nodes: - are there some best practice about the instance type (EBS or local-storage)?
3) If I want to do a backup of the cluster on S3: - when I do a distcp from the CDH to S3, can I move the data directly on Glacier instead of the normal S3? If I have some compression applied on the data (e.g. snappy, gzip, etc.) and I do a distcp to S3: - Is the space occupied on S3 the same or the distcp command decompress the data for the copy?
If I have a cluster based on EBS attached instances: - is it possible to snapshot the disks and re-attach a datanode with the EBS disks rebuilt from the snapshot?
4) If I have the Data Nodes deployed as r4.8xlarge and I need more horsepower, is it possible to scale-up the cluster from r4.8xlarge to a r4.16xlarge on-the-fly? Attaching and detaching the disks in few mins?
Thanks a lot for the clarifications, I hope my doubts will help also other users.
1) There's no explicit restriction on instance types where CDH components will work, but you'd need to pick types with a minimum of horsepower. For example, I don't expect that a micro size instance would work for much of anything. A type that is too small will generally cause daemons to run out of memory. The reference architecture has suggested instance types for certain situations.
2) You should stick with EBS for the root volume of instance types. There are a few reasons, including that newer instance types don't even support local instance storage for the root disk.
CDH doesn't have a mechanism for replacing data nodes when they fail. You could roll something yourself, possibly with help from Cloudera Director.
3) You can set up lifecycle rules for data in S3 to migrate it from the standard storage class into Glacier over time, or you can just write directly to Glacier; it doesn't look like direct Glacier access can be done through the s3a connector. I'm pretty sure distcp and S3 won't fiddle with compression; what you copy is opaque to S3 for sure. You can snapshot EBS volumes (root or additionally attached), then detach them and re-attach them to a different instance; this isn't necessarily a great way to back up datanodes vs. the distcp route, because each datanode is unique and has changing data as the cluster runs.
4) You can resize EBS-backed EC2 instances without detaching and re-attaching disks. You do have to stop an instance to resize it.
Point 3 only:
You need to distcp to S3 and move that to glacier via the AWS settings
It doesn't do anything to the data, compression, etc.
see the (hortonworks doc) Distcp and S3 and read its warnings/caveats. In particular, incremental distcp isn't checksum-based, atomic distcp isn't, it's just "really slow distcp"

Working with ECS container instance without the EBS

I am using the free tier of AWS. I am experimenting with ECS and am following the article http://docs.aws.amazon.com/AmazonECS/latest/developerguide/launch_container_instance.html to create an ECS instance. One this I noticed is that using the community image amzn-ami-2016.03.e-amazon-ecs-optimized adds an EBS volume which cuts into my free tier usage. My question is, is this EBS volume required and can I do it without the EBS volume?
Any EC2 instance would need a Root volume at the very least to start the OS. All volumes in AWS are EBS volumes. So if you were wondering if you can have an EC2 instance without EBS, I don't think that is possible.
However, you can still reduce your EBS cost. It costs 10 cents per GB per month for an EBS volume. If you would notice, all Amazon ECS optimized EC2 instances are configured to use 30GB of EBS volume storage. That means you pay $3.00 extra per EC2 instance for a month! 8 out of that 30 GB is for Root, and 22 out of 30GB is for docker use.
Source:
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-ami-storage-config.html
By default, the Amazon ECS-optimized Amazon Linux AMI ships with 30
GiB of total storage. You can modify this value at launch time to
increase or decrease the available storage on your container instance.
This storage is used for the operating system and for Docker images
and metadata. The sections below describe the storage configuration of
the Amazon ECS-optimized Amazon Linux AMI, based on the AMI version.
Of course, you don't need the full 8GB for Root and full 22GB for docker. So you can lower your cost by reducing the size of those volumes to say 2GB for Root and 2GB for docker use. Then you would be paying 40 cents per month, and not $3.00.
Reducing volume size, as far as I know, is not easy.
Since that is out of scope for this question, I will just provide this link for interested parties.
Now that you are aware that there are 2 volumes used by ecs optimized instances, there is a way for you to NOT use the 22GB volume at all, and simply use the Root volume for docker storage. This too is not easy but can be done by creating your own AMI with docker and ecs agent installed. Then you will have to configure your docker to use the Root volume instead of the other one. Here is a thread which slightly discussed this issue.
For AWS ECS there is no additional charge for Amazon EC2 Container Service. You pay for AWS resources (e.g. EC2 instances or EBS volumes) you create to store and run your application. Free tier in AWS https://aws.amazon.com/free/ only Amazon EC2 Container Registry is part of free tier which offers 500 MB for storage.
And also if you are creating ECS containers from amzn-ami-2016.03.e-amazon-ecs-optimized AMI the volumes will be EBS so you will have to pay for EBS volumes.

How to stop EMR Cluster without terminating it?

I know it is possible to stop individual EC2 instances, but what about the EMR cluster?
If I stop all EC2 instances comprising EMR cluster, would I still be billed?
At this time there is not a way to STOP and EMR cluster in the same sense you can with EC2 instances. The EMR cluster uses instance-store volumes and the EC2 start/stop feature relies on the use of EBS volumes which are not appropriate for high-performance, low-latency HDFS utilization.
The best way to simulate this behavior is to store the data in S3 and then just ingest as a start up step of the cluster then save back to S3 when done.