Continuous Integration on AWS EMR - amazon-web-services

We have a long running EMR cluster that has multiple libraries installed on it using bootstrap actions. Some of these libraries are under continuous development and their codebase is on GitHub.
I've been looking to plug Travis CI with AWS EMR in a similar way to Travis and CodeDeploy. The idea is to get the code on GitHub tested and deployed automatically to EMR while using bootstrap actions to install the updated libraries on all EMR's nodes.
A solution I came up with is to use an EC2 instance in the middle, where Travis and CodeDeploy can be first used to deploy the code on the instance. After that a lunch script on the instance is triggered to create a new EMR cluster with the updated libraries.
However, the above solution means we need to create a new EMR cluster every time we deploy a new version of the system
Any other suggestions?

You definitely don't want to maintain an EC2 instance to orchestrate a CI/CD process like that. First of all, it introduces a number of challenges because then you need to deal with an entire server instance, keep it maintained, deal with networking, apply monitoring and alerts to deal with availability issues, and even then, you won't have availability guarantees, which may cause other issues. Most of all, maintaining an EC2 instance for a purpose like that simply is unnecessary.
I recommend that you investigate using Amazon CodePipeline with a Lambda Step Function.
The Step Function can be used to orchestrate the provisioning of your EMR cluster in a fully serverless environment. With CodePipeline, you can setup a web hook into your Github repo to pull your code and spin up a new deployment automatically whenever changes are committed to your master Github branch (or whatever branch you specify). You can use EMRFS to sync an S3 bucket or folder to your EMR file system for your cluster and then obtain the security benefits of IAM, as well as additional consistency guarantees that come with EMRFS. With Lambda, you also get seamless integration into other services, such as Kinesis, DynamoDB, and CloudWatch, among many others, that will simplify many administrative and development tasks, as well as enable you to have more sophisticated automation with minimal effort.
There are some great resources and tutorials for using CodePipeline with EMR, as well as in general. Here are some examples:
https://aws.amazon.com/blogs/big-data/implement-continuous-integration-and-delivery-of-apache-spark-applications-using-aws/
https://docs.aws.amazon.com/codepipeline/latest/userguide/tutorials-ecs-ecr-codedeploy.html
https://chalice-workshop.readthedocs.io/en/latest/index.html
There are also great tutorials for orchestrating applications with Lambda Step Functions, including the use of EMR. Here are some examples:
https://aws.amazon.com/blogs/big-data/orchestrate-apache-spark-applications-using-aws-step-functions-and-apache-livy/
https://aws.amazon.com/blogs/big-data/orchestrate-multiple-etl-jobs-using-aws-step-functions-and-aws-lambda/
https://github.com/DavidWells/serverless-workshop/tree/master/lessons-code-complete/events/step-functions
https://github.com/aws-samples/lambda-refarch-imagerecognition
https://github.com/aws-samples/aws-serverless-workshops
In the very worst case, if all of those options fail, such as if you need very strict control over the startup process on the EMR cluster after the EMR cluster completes its bootstrapping, you can always create a Java JAR that is loaded as a final step and then use that to either execute a shell script or use the various Amazon Java libraries to run your provisioning commands. In even this case, you still have no need to maintain your own EC2 instance for orchestration purposes (which, in my opinion, still would be hard to justify even if it was running in a Docker container in Kubernetes) because you can easily maintain that deployment process as well with a fully serverless approach.
There are many great videos from the Amazon re:Invent conferences that you may want to watch to get a jump start before you dive into the workshops. For example:
https://www.youtube.com/watch?v=dCDZ7HR7dms
https://www.youtube.com/watch?v=Xi_WrinvTnM&t=1470s
Many more such videos are available on YouTube.
Travis CI also supports Lambda deployment, as mentioned here: https://docs.travis-ci.com/user/deployment/lambda/

Related

Best practices to Initialize and populate a serverless PostgreSQL RDS instance by a CloudFormation stack deployment

We are successfully spinning up an AWS CloudFormation stack that includes a serverless RDS PostgreSQL instance. Once the PostgreSQL instance is in place, we're automatically restoring a PostgreSQL database dump (in binary format) that was created using pg_dump on a local development machine on the instance PostgreSQL instance just created by CloudFormation.
We're using a Lambda function (instantiated by the CloudFormation process) that includes a build of the pg_restore executable within a Lambda layer and we've also packaged our database dump file within the Lambda.
The above approached seems complicated for something that presumably has been solved many times... but Google searches have revealed almost nothing that corresponds to our scenario. ​We may be thinking about our situation in the wrong way, so please feel free to offer a different approach (e.g., is there a CodePipeline/CodeBuild approach that would automate everything). Our preference would be to stick with the AWS toolset as much as possible.
This process will be run every time we deploy to a new environment (e.g., development, test, stage, pre-production, demonstration, alpha, beta, production, troubleshooting) potentially per release as part of our CI/CD practices.
Does anyone have advice or a blog post that illustrates another way to achieve our goal?
If you have provisioned everything via IaC (Infrastructure as Code) anyway that is most of the time-saving done and you should already be able to replicate your infrastructure to other accounts/ stacks via different roles in your AWS credentials and config files by passing the —profile some-profile flag. I would recommend AWS SAM (Serverless application model) over Cloudformation though as I find I only need to write ~1/2 the code (roles & policies are created for you mostly) and much better & faster feedback via the console. I would also recommend sam sync (it is in beta currently but worth using) so you don’t need to create a ‘change set’ on code updates, purely the code is updated so deploys take 3-4 secs. Some AWS SAM examples here, check both videos and patterns: https://serverlessland.com (official AWS site)
In terms of the restore of RDS, I’d probably create a base RDS then take a snapshot of that and restore all other RDS instances from snapshots rather than manually creating it all the time you are able to copy snapshots and in fact automate backups to snapshots cross-account (and cross-region if required) which should be a little cleaner
There is also a clean solution for replicating data instantaneously across AWS Accounts using AWS DMS (Data Migration Services) and CDC (Change Data Capture). So, process is you have a source DB and a target DB and also a 'Replication Instance' (eg EC2 Micro) that monitors your source DB and can replicate that across to different instances (so you are always in sync on data, for example if you have several devs working on separate 'stacks' to not pollute each others' logs, you can replicate data and changes out seamlessly if required from one DB) - so this works with Source DB and Destination DB both already in AWS however you can also use AWS DMS of course for local DB migration over to AWS, some more info here: https://docs.aws.amazon.com/dms/latest/sbs/chap-manageddatabases.html

What is the difference between AWS Lambda & AWS Elastic Beanstalk

I am studying for my AWS Cloud Practitioner Certification and I am confused with the difference between AWS Lambda & AWS Elastic Beanstalk. From my understanding, for both services you upload your code to AWS and AWS essentially manages the underlying infrastructure for you.
I know with Lambda you upload your code to a 'Lambda Function' and set triggers for when the code executes.
With AWS EB you upload your application code and EB automatically handles the deployment, capacity, provisioning, etc...
They both sound very similar as you upload your code to both and both handle underlying instances/environments.
Thanks!
Elastic beanstalk and lambda are very different though some of the features may look similar. At high level, elastic beanstalk deploys a long running application whereas lambda deploys short running code function
Lambda can at maximum run for 15 minutes, whereas EB can run continuously. Generally, we deploy websites/apps on EB whereas lambda are generally used for triggered functionality like processing image when image gets uploaded to S3.
Lambda can only handle one request at a time whereas number of concurrent requests EB can handle depends on your underlying infrastructure. So, if you are having say 100 requests, 100 lambdas will be created whereas these 100 requests can be handled by one underlying EC2 instance in EB
Lambda is serverless (underlying infra is entirely abstracted from developer). Whereas EB is automation over infra provisioning. You can still see your EC2 instances, load balancer, auto scaling group etc. in your AWS console. You can even ssh/rdp to your instance and change running services. AWS EB allows you also to have your custom AMIs.
Lambda is having issue of cold starts as in lambda, infra needs to be provisioned on demand by AWS, whereas in EB, you generally have EC2 instances already provisioned to handle your requests.
All great (and exam-specific) points by SmartCoder. If I may add a general ancillary comment:
Wittgenstein said, "In most cases, the meaning of a word is its use." I think this maxim is remarkably apt for software engineering too. In the context of your question, those two AWS services are used for significantly different purposes.
Lambda - Say you developed a photo uploading application with Node.js that uploads some processed images to an S3 bucket. The core logic for this is probably quite straightforward, and it's got a singular, distinct task. Simply take in an image, do some processing and if not for any exception, store it in a bucket. In this case, it's inefficient to waste time spinning up servers, configuring them with a runtime environment, downloading dependencies, maintenance, etc. A literal copy and paste of your code into the Lambda console while setting up a few configurations should get your job done. Plus, you save a lot of money as infrastructure is "provisioned" only when your Node.js function is invoked. Again, keep in mind the principle of this code performing a singular task.
Elastic Beanstalk - This same photo uploading system mentioned above might now mature into a more complex full-fledged software application that requires user management, authentication, and further processing of the images, which certainly requires more provisioning of resources. This application will probably do a lot of things with multiple code repositories for you to manage and deploy. And yet, you don't want to spend money on a DevOps engineer or learn to use an IaC (Infrastructure as Code) platform like CloudFormation or Terraform. In this case, Elastic Beanstalk is useful for a developer without too much in-depth DevOps knowledge as it's a PaaS (Platform as a Service) tool; it pretty much gives you a clear interface to spin up whole new production-ready systems.
Here are two good whitepapers I read a while back on the above topics.
https://docs.aws.amazon.com/whitepapers/latest/serverless-architectures-lambda/serverless-architectures-lambda.pdf
https://docs.aws.amazon.com/whitepapers/latest/introduction-devops-aws/introduction-devops-aws.pdf
Lambda is run based on specific trigger events.. and it exits as soon as its work is over.

from gitlab ci/cd to AWS EC2

It's beens ome time since I've been trying to figure out the really easy way.
I am using gitlab CI/CD and want to move the built data from there to AWS EC2. Problem is i found 2 ways which both are really bad ideas.
building project on gitlab ci/cd, then ssh into the AWS, pull the project from there again, and run npm scripts. This is really wrong and I won't go into details why.
I saw the following: How to deploy with Gitlab-Ci to EC2 using AWS CodeDeploy/CodePipeline/S3 , but it's so big and complex.
Isn't there any easier way to copy built files from gitlab ci/cd to AWS EC2 ?
I use Gitlab as well, and what has worked for me is configuring my runners on EC2 instances. A few options come to mind:
I'd suggest managing your own runners (vs. shared runners) and
giving them permissions to drop built files in S3 and have your
instances pick from there. You could trigger SSM commands from the
runner targeting your instances (preferably by tags) and they'll
download the built files.
You could also look into S3 notifications. I've used them to trigger
Lambda functions on object uploads: it's pretty fast and offers
retry mechanisms. The Lambda could then push SSM commands to
instances. https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html

AWS CLI vs Console and CloudFormation stacks

Is there any known downside to creating resources on aws through the CLI? Is it more reliable/easier/error prone/largely accepted/recommended to use one method over the other? While setting up recurring scripts, is there a reason why i would want to use CloudFormation or the AWS Console over the AWS CLI to run commands directly?
For example, if I were to create an ECS Fargate Task Definition, is there any reason why I might want to use AWS CloudFormation or the Console over AWS CLI? Cli syntax is straightforward and easy to use, and there are a few things (like setting up event rules/targets for a fargate task specifically) that are not supported via cloudformation yet.
The AWS CLI and AWS CloudFormation are two different tools that can be used to create infrastructure on AWS. The CLI is more powerful and has finer grained control than CloudFormation. CloudFormation makes it very easy to use yaml or json text files that can describe an entire enterprise in the cloud.
One of the strong benefits of CloudFormation is the automatic support for rolling back changes if anything fails while deploying a stack. The CLI in comparison would require you to figure out the details of what went wrong and how to get back to where your state was. Updating infrastructure using CloudFormation is another benefit. Make the change in the template and update the stack.
For small setups, using the CLI is fine. However, once you get past launching an EC2 instance and start building VPCs, Instances, KeyPairs, Security Groups, RDS, etc. etc. you will find that the CLI has some real limitations: mostly being too manual of a process, not easily repeatable, difficult to put the process into version control, ....
If you are constantly building, testing and deleting complex setups, CloudFormation is absolutely one of the best tools from AWS. Note that there are a number of third party solutions that have a huge number of followers such as Bamboo, Octopus, Jenkins, Chef, etc.
If your job is SysOps or DevOps then you absolutely want to master the CLI and CloudFormation. These are amazing tools for working with AWS. Also master Beanstalk, maybe OpsWorks and one of the third party tools like Jenkins.

Exploring tools to trigger build script to rollout specific git branch to a subset of the amazon ec2 instances

We have multiple amazon ec2 instances behind a load balancer. Our build script is written in phing and is integrated with git.
We are looking for a tool (like Jenkins or Amazon code deploy) which could display all the active instances currently behind load balancer and then allow us to select some of them (or select a group defined previously) and then trigger either of the following (whichever is better) -
a build script hosted on the same dedicated server where the tool is hosted.
or the respective build scripts hosted on the selected ec2 instances.
We should be able to do the following -
specify a git branch name, optionally, when we trigger the build script for any group of instances.
be able to roll out in batches of boxes, so as to get some time to monitor load, and then move to next batch if all is good. Best way, I guess, would be to specify a size of the batch (e.g. 10), so that the process waits for a user prompt after rollout on every batch completes.
So, if we have to rollout two different git branches to two groups of instances, we should be able to run them in two steps (if we do not specify batch size).
Would like to know about experiences of people who dealt with something similar.
For CodeDeploy, it supports Git (more precisely, GitHub). It also allows you to deploy only to tagged EC2 instances. If combined with custom DeploymentConfig (http://docs.aws.amazon.com/codedeploy/latest/userguide/how-to-create-deployment-configuration.html), you can also control how fast (the size of the batch) to deploy.
I would re-structure the question:
The choices you have for application deployment
and whether the tool has option to perform rolling deployments.
Jenkins is software for CI/CD, which will have to use plugins,custom scripting or leverage an existing orchestration software setup for doing the deployments.
For software orchestration, you have many choices, some of the more famous tools are Chef, puppet, ansible etc.. All of these would need you to manage some kind of centralized setup. All such software support application deployment.
You need to make a decision on whether you would want to invest in maintaining such a setup.
If you decide against such a setup, you have the option of using managed services such as AWS OpsWorks, AWS CodeDeploy, hosted chef etc.
In choosing any of these services, you delegate the management of orchestration software to a vendor, which will ensure the service is up all the time.
AWS code deploy and AWS OpsWorks are managed services on aws and work pretty well on AWS setups.
AWS OpsWorks uses chef under the hood.
AWS CodeDeploy only provides a subset of what OpsWorks provides and is responsible only for deployments. With AWS code deploy you get convenient visualization of your software deployments through AWS console.
With AWS code deploy, you can achieve the goal of partial roll out to ec2 instances.
You can do the same with other tools as well but CodeDeploy on AWS environment will take least amount of work.
CodeDeploy also allows you to deploy from GIT. Please refer to the following aws documentation
http://docs.aws.amazon.com/codedeploy/latest/userguide/github-integ-tutorial.html
The pitfall with code deploy is the fact that the agent that will run on instances has been tested for and is supported for only a limited number of OS combinations.(http://docs.aws.amazon.com/codedeploy/latest/userguide/how-to-run-agent.html#how-to-run-agent-supported-oses)
Also in future if you decide to move away from AWS, you will have to redo the deployment related work.
CodeDeploy service only charges you for the underneath AWS resources.
Please find the link to pricing documentation below:
https://aws.amazon.com/codedeploy/pricing/