We have been running a production grade system where we want to start a secondary namenode in AWS EMR automatically.
Below is the output of jps in which secondary namenode daemon is not running
[root#ip-10-2-23-23 ~]# jps
6241 Bootstrap
7041 ResourceManager
10754 RunJar
6818 WebAppProxyServer
10787 SparkSubmit
7619 JobHistoryServer
6922 ApplicationHistoryServer
3661 Main
4877 Main
6318 NameNode
8943 LivyServer
4499 Jps
5908 Bootstrap
4791 Main
10619 StatePusher
9918 HistoryServer
Secondary namenode is required to do namenode checkpointing and do regular creation of fsImage .I have not configured any HA for Namenode.
Command we ran manually to create FsImage is
hdfs secondarynamenode -checkpoint
How a secondary namenode can be started in AWS EMR or there is any configuration ?
Hadoop version : Hadoop 2.8.3-amzn-0
AWS EMR doesn't run secondary Namenode proces on EMR so FSImage won't be created, Running a cron every hour to create a FSImage solves the problem of too much disk usage because the FSImage merges the snapshot (Namenode metadata) to create a new FsImage of smaller size. FSImage creation is a costly operation for Namenode and it utilizes instance resources. If there are too many snapshots pending for merging , Namenode may never recover from this tedious process , so it is better to create FSImage frequently via cron .In a standard Hadoop system this job is done by running a secondary Namenode on a separate instance but EMR doesn't have concept of two masters so Master node is always a single point of failure.
hdfs secondarynamenode -checkpoint
Other solution to this problem is running EMR on custom Hadoop like MapR .
Related
master node - does this node stores hdfs data in aws emr cluster?
task node - if this node does not store hdfs data, is it purely computational node? in this case does hadoop transfer to task node? does this not defeat data localization computation advantgae?
(Other than the edge case of a master-only cluster with no core or task instances...)
The master instance does not store any HDFS data, nor does it act as a computational node. The master instance runs services like the YARN ResourceManager and HDFS NameNode.
The only nodes that store data are those that run HDFS DataNode, which are only the core instances.
The core and task instances both run YARN NodeManager and thus are the "computational nodes".
Regarding your question, "in this case does hadoop transfer to task node", I assume that you are asking whether or not Hadoop transfers (HDFS) data to the task instances so that they may perform computations on HDFS data. In a sense, yes, task instances may read HDFS blocks remotely from core instances where the blocks are stored.
It's true that this means that task instances can never take advantage of data locality for HDFS data, but there are many cases where this does not matter anyway, such as for tasks that are read shuffle data from other nodes, or tasks that are reading data from remote storage anyway (e.g., Amazon S3). Furthermore, depending upon the core instance type being used, keep in mind that even the HDFS blocks might be getting stored in remote storage (i.e., EBS). That said, even when your task instances are reading data from a remote DataNode or remote service like S3 or EBS, it might not even be noticeable to the point that you need to worry about data locality.
According to AWS ElastiCache documentation, a cluster loses all its data upon reboot: https://docs.aws.amazon.com/AmazonElastiCache/latest/mem-ug/Clusters.Rebooting.html
When you reboot a cluster, the cluster flushes all its data and
restarts its engine. During this process you cannot access the
cluster. Because the cluster flushed all its data, when the cluster is
available again, you are starting with an empty cluster.
The automatic daily backups can only be used in a new cluster to get a warm-started cluster:
https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/backups-automatic.html
AOF file option has been disabled in Redis 2.8+ in AWS ElastiCache: https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/RedisAOF.html
How do we make AWS ElastiCache use the most recent daily backup data upon hardware failure/restarts
I am looking for same thing but as of now it is not possible. Backup can only be used to build new cluster.
We're using Amazon EMR for our oozie workflows which contains Spark jobs. In our cluster, we have 1 Master, 2 core nodes and using third party tool for Task nodes as spot instances.
Autoscaling is setup for task nodes based on Yarn memory usage. We have configured to launch Application Master only in core nodes as task nodes are spot instances which would go down any time.
Now the problem is sometimes running jobs occupy the Core nodes memory fully (AM + Task executors), which leaves other jobs in accepted state waiting for core node to free up to launch AM.
I'd like to know is it possible to limit only AM to launch in Core nodes and task executors in task nodes. This way we'll be able to run multiple jobs in parallel.
Does Athena have a gigantic cluster of machines ready to take queries from users and run them against their data? Are they using a specific open-source cluster management software for this?
I believe AWS will never disclose how they operate Athena service. However, as Athena is managed PrestoDB the overall design can be deduced based on that.
PrestoDB does not require cluster manager like YARN, Messos. It has own planner and scheduler that is able to run SQL physical plan on worker nodes.
I assume that AWS within each availability zone maintains PrestoDB coordinator connected to data catalog(AWS Glue) and set of presto worker. Workers are elastic and autoscaled. In case of inactivity, they're downscaled, but when the burst of activity occurs new workers added to the cluster.
I am reading s3 buckets with drill and writing it back to s3 with parquet in order to read it with spark data frames for further analysis. I am required by AWS emr to have at least 2 core machines.
will using i mirco instance for master and cores affect performance?
I don't make a use of hdfs as such so I am thinking to make them mirco instances to save money.
All computation will be done in memory by R3.xlarge spot instances as task nodes anyway.
And finally does spark utilise multiple cores in each machine? or is it better to launch fleet of task nodes R3.xlarge with 4.1 version so they can be auto resized?
I don't know how familiar you are with Spark but there is a couple of things you need to know about core usage :
You can set the number of cores to use for the driver process, only in cluster mode. It's 1 by default.
You can also set the number of cores to use on each executor. For YARN and standalone mode only. It's 1 in YARN mode, and all the available cores on the worker in standalone mode. In standalone mode, setting this parameter allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker. Otherwise, only one executor per application will run on each worker.
Now to answer both of your questions :
will using i micro instance for master and cores affect performance?
Yes, the driver needs minimum resources to schedule job, collect data sometimes etc. Performance-wise you'll need to benchmark according to your use case on what suits your usage better which you can do using Ganglia per example on AWS.
does spark utilise multiple cores in each machine?
Yes Spark uses multiple cores on each machine.
You can also read this concerning Which instance type is preferred for AWS EMR cluster for Spark.
The support of Spark is nearly new on AWS, but it's usually close to all other Spark cluster setups.
I advice you to read the AWS EMR developer guide - Plan EMR Instances chapter along with the Spark official documentation guide.