AWS EMR: Does master node stores hdfs data in EMR cluster? - amazon-web-services

master node - does this node stores hdfs data in aws emr cluster?
task node - if this node does not store hdfs data, is it purely computational node? in this case does hadoop transfer to task node? does this not defeat data localization computation advantgae?

(Other than the edge case of a master-only cluster with no core or task instances...)
The master instance does not store any HDFS data, nor does it act as a computational node. The master instance runs services like the YARN ResourceManager and HDFS NameNode.
The only nodes that store data are those that run HDFS DataNode, which are only the core instances.
The core and task instances both run YARN NodeManager and thus are the "computational nodes".
Regarding your question, "in this case does hadoop transfer to task node", I assume that you are asking whether or not Hadoop transfers (HDFS) data to the task instances so that they may perform computations on HDFS data. In a sense, yes, task instances may read HDFS blocks remotely from core instances where the blocks are stored.
It's true that this means that task instances can never take advantage of data locality for HDFS data, but there are many cases where this does not matter anyway, such as for tasks that are read shuffle data from other nodes, or tasks that are reading data from remote storage anyway (e.g., Amazon S3). Furthermore, depending upon the core instance type being used, keep in mind that even the HDFS blocks might be getting stored in remote storage (i.e., EBS). That said, even when your task instances are reading data from a remote DataNode or remote service like S3 or EBS, it might not even be noticeable to the point that you need to worry about data locality.

Related

Best practice to read data from EMR to physical server

I am using pyspark to read data from EMR. But if the EMR cluster is fully occupied, I can see on the cluster manager that all the memories are occupied by some ETL job, still can I run this script on my physical server that bring data from the EMR cluster to my physical server ?
what is the best practice suggest ?
Will it take the same amount of time to read the data from EMR to physical server? how does it handle the request if it is requested to read the data when it is fully occupied on EMR?
What kind of processes are executed on EMR (s3 bucket) while accessing/reading data from physical server through s3 utility?
Can I pull data to physical server when EMR cluster is fully occupied? If no, why ?
Thanks and Regards
The best practice is to expand your cluster and do things consistently. I strongly suggest that you do not develop out-of-band/one-off processes that move things outside of the normal flow of work.

AWS EMR: do i need core nodes if I am planning to use EMRFS

I am going through the online documentation and I found following diference between core not and task node.
Core node has hdfs while task node does not have HDFS.
due to above, AWS suggest it's not a good idea to scale core nodes based on load as hdfs re-balancing could take time and should re-balance task nodes only.
However, if I am planning to use EMRFS, do i need core nodes? what is the user of HDFS in this case if I am planning to access data from s3.
You need at least 1 Core Node.
If you want to use s3distcp after finish a job writing to local HDFS, then you need more such Core Nodes.

Architect a Cloudera CDH cluster on AWS: instances and storage

I have some doubts about a deployment of CDH on AWS. I read the reference architecture doc and other material I found on Cloudera Engineering Blog but I need some more suggestions about it.
1) Is the CDH deployment available only for some kind of instances or I can deploy it on all the AWS instance types?
2) Assuming I want to create a cluster that will be active 24x7. For a long-running cluster I understood it's better to have a cluster based on local-storage instances. If we consider a cluster of 2PBs I think that d2.8xlarge should be the best choice for the datanodes. About the Master Nodes: - if I want to deploy only 3 Master Nodes, is it better to have them as local-storage instances too or as EBS attached instances to be able to react quickly to a possible Master Node failure? - are there some best practice about the master node instance type (EBS or local-storage)? About the Data Nodes: - if a data node fails, Has the CDH some automated mechanism to automatically spin-up a new instance and connect it to the cluster in order to restore the cluster without down-times? Have we to build a script from scratch to do this thing? About the Edge Nodes: - are there some best practice about the instance type (EBS or local-storage)?
3) If I want to do a backup of the cluster on S3: - when I do a distcp from the CDH to S3, can I move the data directly on Glacier instead of the normal S3? If I have some compression applied on the data (e.g. snappy, gzip, etc.) and I do a distcp to S3: - Is the space occupied on S3 the same or the distcp command decompress the data for the copy?
If I have a cluster based on EBS attached instances: - is it possible to snapshot the disks and re-attach a datanode with the EBS disks rebuilt from the snapshot?
4) If I have the Data Nodes deployed as r4.8xlarge and I need more horsepower, is it possible to scale-up the cluster from r4.8xlarge to a r4.16xlarge on-the-fly? Attaching and detaching the disks in few mins?
Thanks a lot for the clarifications, I hope my doubts will help also other users.
1) There's no explicit restriction on instance types where CDH components will work, but you'd need to pick types with a minimum of horsepower. For example, I don't expect that a micro size instance would work for much of anything. A type that is too small will generally cause daemons to run out of memory. The reference architecture has suggested instance types for certain situations.
2) You should stick with EBS for the root volume of instance types. There are a few reasons, including that newer instance types don't even support local instance storage for the root disk.
CDH doesn't have a mechanism for replacing data nodes when they fail. You could roll something yourself, possibly with help from Cloudera Director.
3) You can set up lifecycle rules for data in S3 to migrate it from the standard storage class into Glacier over time, or you can just write directly to Glacier; it doesn't look like direct Glacier access can be done through the s3a connector. I'm pretty sure distcp and S3 won't fiddle with compression; what you copy is opaque to S3 for sure. You can snapshot EBS volumes (root or additionally attached), then detach them and re-attach them to a different instance; this isn't necessarily a great way to back up datanodes vs. the distcp route, because each datanode is unique and has changing data as the cluster runs.
4) You can resize EBS-backed EC2 instances without detaching and re-attaching disks. You do have to stop an instance to resize it.
Point 3 only:
You need to distcp to S3 and move that to glacier via the AWS settings
It doesn't do anything to the data, compression, etc.
see the (hortonworks doc) Distcp and S3 and read its warnings/caveats. In particular, incremental distcp isn't checksum-based, atomic distcp isn't, it's just "really slow distcp"

Emr 4.2 Spot Prices on Core Nodes

Now that EMR supports downsizing of Core nodes on EMR, if I create an EMR cluster with 1 of the core nodes as a spot instance. What happens when the spot price exceeds the bid price for my core node? Will it gracefully decomission that core node?
Here is Amazon's description of the process of shrinking the number of core nodes:
On core nodes, both YARN NodeManager and HDFS DataNode daemons must be
decommissioned in order for the instance group to shrink. For YARN,
graceful shrink ensures that a node marked for decommissioning is only
transitioned to the DECOMMISIONED state if there are no pending or
incomplete containers or applications. The decommissioning finishes
immediately if there are no running containers on the node at the
beginning of decommissioning.
For HDFS, graceful shrink ensures that the target capacity of HDFS is
large enough to fit all existing blocks. If the target capacity is not
large enough, only a partial amount of core instances are
decommissioned such that the remaining nodes can handle the current
data residing in HDFS. You should ensure additional HDFS capacity to
allow further decommissioning. You should also try to minimize write
I/O before attempting to shrink instance groups as that may delay the
completion of the resize operation.
Another limit is the default replication factor, dfs.replication
inside /etc/hadoop/conf/hdfs-site. Amazon EMR configures the value
based on the number of instances in the cluster: 1 with 1-3 instances,
2 for clusters with 4-9 instances, and 3 for clusters with 10+
instances. Graceful shrink does not allow you to shrink core nodes
below the HDFS replication factor; this is to prevent HDFS from being
unable to close files due insufficient replicas. To circumvent this
limit, you must lower the replication factor and restart the NameNode
daemon.
I think it might not be possible to gracefully decommission the node in case of Spot price spike (general case with N core nodes). There is a 2 minute notification possible before the Spot Instance is removed due to price spike. Even if captured, this time period might not be sufficient to guarantee decommission of HDFS data.
Also, with only 1 core node in cluster, decommissioning does not make much sense. The data held in the cluster needs to be moved to other nodes, which are not available in this case. Once the only available core node is lost, there needs to be a way to bring one back, else the cluster cannot run any tasks.
Shameless plug :) : I work for Qubole!
The following 2 blog posts might be useful around integration of Spot instances with Hadoop clusters including dealing with Spot price spikes.
https://www.qubole.com/blog/product/riding-the-spotted-elephant
https://www.qubole.com/blog/product/rebalancing-hadoop-higher-spot-utilization

It is possible use AutoScaling with Elastic Mapreduce?

I would like to know if I can use AutoScaling to automatically scaling up or down Amazon Ec2 capacity according to cpu utilization with elastic map reduce.
For example, I start a mapreduce job with only 1 instance, but if this instance arrive to 50% utilization for example I want to use the created AutoScaling group to start a new instance. This is possible?
Do you know if it is possible? Or elastic mapreduce because is "elastic", if it needs starts automatically more instances without any configuration?
You need Qubole: http://www.qubole.com/blog/product/industrys-first-auto-scaling-hadoop-clusters/
We have never seen any of our users/customers use vanilla auto-scaling successfully with Hadoop. Hadoop is stateful. Nodes hold HDFS data and intermediate outputs. Deleting nodes based on cpu/memory just doesn't work. Adding nodes needs sophistication - this isn't a web site. One needs to look at the sizes of jobs submitted and the speed at which they are completing.
We run the largest Hadoop clusters, easily, on AWS (for our customers). And they auto-scale all the time. And they use spot instances. And it costs the same as EMR.
No, Auto Scaling cannot be used with Amazon Elastic MapReduce (EMR).
It is possible to scale EMR via API or Command-Line calls, adding and removing Task Nodes (which do not host HDFS storage). Note that it is not possible to remove Core Nodes (because they host HDFS storage, and removing nodes could lead to lost data). In fact, this is the only difference between Core and Task nodes.
It is also possible to change the number of nodes from within an EMR "Step". Steps are executed sequentially, so the cluster could be made larger prior to a step requiring heavy processing, and could be reduced in size in a subsequent step.
From the EMR Developer Guide:
You can have a different number of slave nodes for each cluster step. You can also add a step to a running cluster to modify the number of slave nodes. Because all steps are guaranteed to run sequentially by default, you can specify the number of running slave nodes for any step.
CPU would not be a good metric on which to base scaling of an EMR cluster, since Hadoop will keep all nodes as busy as possible when a job is running. A better metric would be the number of jobs waiting, so that they can finish quicker.
See also:
Stackoverflow: Can we add more Amazon Elastic Mapreduce instances into an existing Amazon Elastic Mapreduce instances?
Stackoverflow: Can Amazon Auto Scaling Service work with Elastic Map Reduce Service?