Bootstrap actions run before Amazon EMR installs the applications that
you specify when you create the cluster and before cluster nodes begin
processing data. If you add nodes to a running cluster, bootstrap
actions also run on those nodes in the same way. You can create custom
bootstrap actions and specify them when you create your cluster.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html
i need to patch the application (presto) after it is installed on all nodes. a few possible solutions are
passwordless ssh, but for some security concern we disabled it.
in the bootstrap schedule a cron job and check if the application is installed then act upon it.
use ssm. but never really tried yet.
any idea?
[Update]
what actually has been done in our case is scheduling a background scripts (the &) in the bootstrap scripts which won't block bootstrap. inside the job, it will periodically check if the package is installed or not, if it is installed (e.g. rpm -q presto), then patch it.
I believe you can use EMR steps to do this. Here is a somewhat relevant What is the correct syntax for running a bash script as a step in EMR? description on how to use it.
Update:
You cannot use EMR steps since steps only run on the master.
Related
I have a question regarding AWS, have an AMI with windows server installed, IIS installed, and a site up and running.
My AutoScale always maintains two instances created based on this AMI.
However, whenever I need to change something on the site I need to upload a new instance, make the changes, update the AMI and update the auto-scale, which is quite time consuming.
Is there any way to automate this by linking to a Git repository?
This is more like a CI CD work rather than achieved in AWS.
You can schedule a CI CD pipeline to detect any update happens in SCM(GIT) and trigger a build job(Jenkins or similar tool) which will provide an artifact to you. You can deploy the artifact to respective application server using CD tools (ansible/even with jenkins or similar tools) whichever suits your infra. In the deploy script itself you can connect to ec2 service to create a new AMI once deployment is completed.
You need to use set of tools to achieve it SCM webhook/poll, Jenkins, Ansible.
Currently working on an environment requirement where we are to push the same file out to multiple EC2 instances running Windows on a scheduled interval. As it stands now, I see a few options and have tried each:
Windows Task Manager: run a basic task on a set schedule invoking the S3 Sync CLI tool
Cons I can see here include: setting up the task on each EC2 instance (there are many).
Lambda: scheduled lambda job that utilizes SSM to run commands on each server in a resource group
Cons: introducing another layer required to execute this task.
Run Command: using an AWS-RunRemoteScript document, run the script (stored in S3) bucket on target instances.
Cons: I'm not positive you can automate these commands on a schedule without adding another layer.
What is the most scalable path forward? Thanks in advance for your help.
Using the Run Command feature of AWS Systems Manager together with either the Maintenance Window feature of AWS Systems Manager or using CloudWatch Events to schedule the execution of Run Command should be useful here.
If you also tag instances appropriately, you can use the tag targeting feature of Run Command to ensure that all instances run the command (including new instances launched in the future as long as they are tagged).
/Mats
I need to install some python packages on the EMR cluster, and AFAIK, I could write down some pip install blabla... commands in EMR's bootstrap actions when CREATING the cluster, and those install-commands will be run when allocating machines for the cluster.
OK, what if the cluster now is created, and later I need to install some other new packages which are not written in the bootstrap actions? I didn't find out any methods for this kind of case, do I HAVE TO re-create a new cluster with the new bootstrap actions?
After the cluster is created, unfortunately, EMR does not provide an API to run a command on ALL NODES.
EMR does have STEP API, where you can run a script on just Master node.
You can either use that STEP API to run a script which can in-turn run a script on all nodes or run a script manually to do so.
There are several options out there like Ansible , pdsh or simply SSH etc, . You can find the list of EMR nodes and its hostnames using YARN -list
I want to orchestrate my EMR jobs. so I thought oozie will be good fit. I have done some POCs on oozie workflow but in local mode, its fairly simple and great.
But I dont understand how to use oozie on EMR cluster.
Based on some search I got to know that aws doesnt come with oozie so we have install it explicitly as a bootstrap action.
Most people point to this link
https://github.com/lila/emr-oozie-sample
But since I am new to aws(EMR) I am still confused how to use it.
It will be great, If anyone can simplify it for me providing some steps or something.
Thanks
I have had some question, which i posted to AWS technical support and i got below reply. I tried it and Oozie is all installed and running with no extra efforts required.
In order to have Oozie installed on an EMR cluster you need to install Hue. The reason is that currently Oozie on EMR is installed as a dependency for Hue. Hue is supported on AMIs 3.3.0 and 3.3.1 as per http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/ami-versions-supported.html. After launching an EMR cluster with Hue -> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hue.html installed you should be able to use Oozie immediately as it will be already configured and started.
EMR 4.x and 5.x series releases now come with Oozie as an optional application. There's also been a recent blog post on the AWS Big Data Blog outlining how to get started with it:
https://blogs.aws.amazon.com/bigdata/post/TxZ4KDBGBMZYJL/Use-Apache-Oozie-Workflows-to-Automate-Apache-Spark-Jobs-and-more-on-Amazon-EMR
That github project installs Oozie as well, so you don't need to take care of it. The configuration for the Oozie installation is in the next link:
https://github.com/lila/emr-oozie-sample/blob/master/config/config-oozie.sh
After that, there are some tasks you can execute from the command shell:
create:
ssh:
sshproxy:
socksproxy:
So, if you follow his instructions you only need to run some of this tasks in order to create and execute an EMR task using Oozie.
For those who are interested, I have cloned the repo and updated the Oozie installer script to support Hadoop 2.4.0 and Oozie 4.0.1
https://github.com/davideanastasia/emr-oozie-sample
Firstly, this is not a direct answer to this question.
EMR integrates with Data Pipeline - Amazon's own scheduler and data workflow orchestrator. Amazon expects you to use Data Pipeline with EMR. It can create, start and terminate EMR clusters, managing cluster lifecycle etc. Evaluate that to see if that fits your needs better..
I am running a couple of ubuntu ec2 instances, I want to run an automation script which will pull the code from Github whenever a new instance is booted from the AMI. The thing is presently I am sshing to the server and run the command git pull origin master and it will ask for password key.
How do I automate this process? So after booting the new instance from a AMI this script should:
Run
Pull the code and also the submodule
Create couple of files and configure it
Please help me to achieve it.
Thanks
This will probably take some time and configuring, but this might set you on the right path.
First, setup your ssh keys, so that you can automatically pull from a repo, without a password. Outlined here: https://help.github.com/articles/generating-ssh-keys
Next, create a startup script to issue the 'pull' command from Github. Here: https://help.ubuntu.com/community/UbuntuBootupHowto
Then save your AMI, When you start a new EC2 AMI, the script should run, pulling in your Github changes.
Also to note, make sure gits remote path is using SSH, if it is HTTPS, it will ALWAYS ask for a password.
Your best best would be to utilize the fact the Ubuntu utilizes CloudInit within its canonical image.
Using CloudInit, you can pass scripts (i.e. shell scripts) to execute at various start up stages as EC2 user-data.
It would be very easy for your to make your GIT command line sequence execute from such a script. He is link to documentation, which includes examples.
https://help.ubuntu.com/community/CloudInit
Create a user-password access to your ubuntu instance. Replicate this particular instance if you need multiple. Now you are free of the key access. If you need to automate a process in that instance cron it or send the script via ssh to that instance and let the cron to find and run it.