How can I Patch my Amazon EMR cluster with security updates? - amazon-web-services

I have an Amazon EMR cluster with 3 nodes (1 master and 2 core) running on Amazon EMR Release 5.31.0 AMI. I want to patch these nodes with security - critical and important patches as we would patch normal EC2 instances. Can we do this?
As EMR runs on EC2 instances in the background and the base OS of EMR Releases is Amazon Linux, I feel we can patch the nodes/instances either by SSH into the instances and running yum commands or using Patch Manager. Is it ok to do this way? Is it recommended?
But when I searched for the same, I found this article:
https://aws.amazon.com/blogs/big-data/create-custom-amis-and-push-updates-to-a-running-amazon-emr-cluster-using-amazon-ec2-systems-manager/
which is asking to use custom AMIs. I feel this is comparatively a long/tough process just to patch an EMR cluster. Is this the only correct way to do or do we have other ways?
Some are suggesting to clone the cluster and use the EMR release 6.x for the new cluster. ??
Can someone please help me on this?

Related

Hardening AWS EC2 Instances

I have launched and AWS ECS cluster with 4 EC2 instances with ecs optimized AMI 2 years ago, the system was working fine but due to systems hardening compliance , I need to update my ECS cluster EC2 instances with latest ECS optimized AMI.
I can take latest AMI and update the instances but how can I automate this process continously, lets say for every 3 months, My autoscaling group should update the instances with latest ECS optimized AMI release by amazon.
My EC2 instances are in autoscaling group, what automation ideas I can implement here.
any AWS doc or github repo link to achieve this also will be very helpful.
Thanks in Advance
Step 1: You can use latest ami ids from AWS System Manager's paramstore and set up notifications when it is changed using EventBridge
Step 2: Write a lamba to update your launch config which has ami ids

Do we need to create two AMI's for master and core in EMR?

I need to create a AWS EMR cluster for spark job with one master and 4 core nodes with auto scaling. I need to have different Instance types for master and core with Ubuntu 16.0 installed on it. So do I need to create two AMI's for this master and slave.
Amazon EMR has its own library of AMIs. You can select the AMI version when launching the cluster.
You can create a custom AMI, but it must be based on Amazon Linux.
See: Using a Custom AMI - Amazon EMR
If you wish to launch a Hadoop cluster with your own Ubuntu AMI, you cannot use the Amazon EMR service. You will need to launch and configure it yourself on Amazon EC2 instances.

What should be suitable configuration to set up 2-3 node hadoop cluster on AWS?

What should be suitable configuration to set up 2-3 node hadoop cluster on AWS ?
I want to set-up Hive, HBase, Solr, Tomcat on hadoop cluster with purpose of doing small POC's.
Also please suggest option to go with EMR or with EC2 and manually set up cluster on that.
Amazon EMR can deploy a multi-node cluster with Hadoop and various applications (eg Hive, HBase) within a few minutes. It is much easier to deploy and manage than trying to deploy your own Hadoop cluster under Amazon EC2.
See: Getting Started: Analyzing Big Data with Amazon EMR

Installing Impala 2.3 on Amazon EMR

I see that Impala 2.3 is only supported on Cloudera CDH 5.5 & above. Impala 2.2 can be installed on Amazon EMR as there is Bootstrap script available on GitHub & you don't require Cloudera installation.
However, I don't see any way to install Cloudera CDH 5.5 or 5.6 on Amazon EMR. I want to install Impala 2.3 so is there any way through which Impala 2.3 can be installed on Amazon EMR?
Well, my previous answer has been deleted as long as "does not provide an answer to the question". I'm not going to argue if it's better to have a partially incorrect answer to this question or if making categorical claims without foundation is a good answer :/.
In any case, I'm not giving up :)
Yes, it's possible to install "anything" on the paper.
Once you launch the EMR cluster, all instances will appear on your EC2 console. The only thing is that you have to be careful assigning the right permissions to access thru SSH to your instances. My suggestion is to create a specific security group with the access and assign this extra security group to the instances using the Advanced configuration of the cluster.
By having the proper configuration, you could ssh into any instance and install anything (you should be able to scp any file or download from internet if you have the proper configuration of your VPC). Note that the user will be "hadoop" instead "ec2-root" but this is documented on the EMR user guide.
Keep in mind that the cluster is "Terminated" so, the EMR instances are volatile and the installation is not going to survive the cluster termination.
On the other hand, using the latest versions of EMR AMIs and the latest capabilities of AWS (I think that it was all the time the case, but, it doesn't matter now) you should be able to create some actions on the bootstrap and install anything you want.
Using the "Advanced configuration" of your cluster, you can access to the "Bootstrap" actions to be executed on your cluster. You could even have different actions depending on the node type (master, core, tasks). You should store your scripts (and/or jar files) on an S3 bucket and made this bucket available to your cluster. On the paper, you could install Impala on these EC2 instances comprising the EMR cluster but I'm not sure if this will work.
For more information, you can read http://docs.aws.amazon.com//emr/latest/ManagementGuide/emr-plan-bootstrap.html
And for a previous version of EMR AMI and not so recent version of Impala you can read https://github.com/awslabs/emr-bootstrap-actions/tree/master/impala
Thanks Mark, you forced me to elaborate better my comment.
No, it is not possible to "install" anything on EMR because it's a PaaS provided by AWS. But if your goal is to run a newer version of Impala on AWS, there is an AWS Quick Start path for installing CDH 5.x (including Impala) that makes the process relatively easy.
http://aws.amazon.com/quickstart/

Possibility of taking snapshot of AWS EMR cluster or namenode

I am new with AWS services and trying some use-cases. I want to create EMR clusters on demand with some predefined configurations and applications/scripts installed. I was planning to create a snapshot of existing EMR cluster or at-least namenode initially and then use it every-time whenever I want to create other clusters. But after some Google search, I couldn't find any way to capture snapshot of EMR cluster. Is it possible to create snapshot ? or any other alternate way that can help me out with my use-case.
Appreciate any kind of help.
Thanks
It is not possible to create a snapshot of an EMR cluster node and you cannot use a custom AMI when running a cluster. However you can install software on the cluster nodes at the cluster creation time using custom bootstrap actions. You can create your custom bootstrap scripts and use them every time you launch a new cluster. This way you can achieve a similar functionality with the one you are seeking.
For more information using bootstrap actions on EMR please visit: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html#bootstrapCustom
Let us know if you need any further assistance.