GCP components to orchestrate crons running in GCE (Google Workflows?) - google-cloud-platform

I need to run a pipeline of data transformation that is composed of several scripts in distinct projects = Python repos.
I am thinking of using Compute Engine to run these scripts in VMs when needed as I can manage resources required.
I need to be able to orchestrate these scripts in the sense that I want to run steps sequentially and sometimes asyncronously.
I see that GCP provides us with a Worflows components which seems to suit this case.
I am thinking of creating a specific project to orchestrate the executions of scripts.
However I cannot see how I can trigger the execution of my scripts which will not be in the same repo as the orchestrator project. From what I understand of GCE, VMs are only created when scripts are executed and provide no persistent HTTP endpoints to be called to trigger the execution from elsewhere.
To illustrate, let say I have two projects step_1 and step_2 which contain separate steps of my data transformation pipeline.
I would also have a project orchestrator with the only use of triggering step_1 and step_2 sequentially in VMs with GCE. This project would not have access to the code repos of these two former projects.
What would be the best practice in this case? Should I use other components than GCE and Worflows for this or there is a way to trigger scripts in GCE from an independent orchestration project?

One possible solution would be to not use GCE (Google Compute Engines) but instead create Docker containers that contain your task steps. These would then be registered with Cloud Run. Cloud Run spins up docker containers on demand and charges you only for the time you spend processing a request. When the request ends, you are no longer charged and hence you are optimally consuming resources. Various events can cause a request in Cloud Run but the most common is a REST call. With this in mind, now assume that your Python code is now packaged in a container which is triggered by a REST server (eg. Flask). Effectively you have created "microservices". These services can then be orchestrated by Cloud Workflows. The invocation of these microservices is through REST endpoints which can be Internet addresses with authorization also present. This would allow the microservices (tasks/steps) to be located in separate GCP projects and the orchestrator would see them as distinct callable endpoints.
Other potentials solutions to look at would be GKE (Kubernetes) and Cloud Composer (Apache Airflow).
If you DO wish to stay with Compute Engines, you can still do that using shared VPC. Shared VPC would allow distinct projects to have network connectivity between each other and you could use Private Catalog to have the GCE instances advertize to each other. You could then have a GCE instance choreograph or, again, choreograph through Cloud Workflows. We would have to check that Cloud Workflows supports parallel items ... I do not believe that as of the time of this post it does.

This is a common request, to organize automation into it's own project. You can setup service account that spans multiple projects.
See a tutorial here: https://gtseres.medium.com/using-service-accounts-across-projects-in-gcp-cf9473fef8f0
On top of that, you can also think to have Workflows in both orchestrator and sublevel project. This way the orchestrator Workflow can call another Workflow. So the job can be easily run, and encapsuled also under the project that has the code + workflow body, and only the triggering comes from other project.

Related

How to disable firebase function versioning

Is there any way to disable google cloud functions versioning?
I've for a long time tried to limit the number of versions kept in the cloud functions history, or if impossible, disable it completely...
This is something that at low level any infrastructure manager will let you do but google intentionally doesn't
When using Firebase Cloud Function, There's a Lifecycle of a background function. As stated from the documentation:
When you update the function by deploying updated code, instances for older versions are cleaned up along with build artifacts in Cloud Storage and Container Registry, and replaced by new instances.
When you delete the function, all instances and zip archives are cleaned up, along with related build artifacts in Cloud Storage and Container Registry. The connection between the function and the event provider is removed.
There is no need to manually clean or remove the previous versions as Firebase deploy scripts are doing it automatically.
Based on the Cloud Functions Execution Environment:
Cloud Functions run in a fully-managed, serverless environment where
Google handles infrastructure, operating systems, and runtime
environments completely on your behalf. Each Cloud Function runs in
its own isolated secure execution context, scales automatically, and
has a lifecycle independent from other functions.
These means that you should not remove build artifacts since cloud functions are scaling automatically and new instances are built from these artifacts.

What is the difference between GCP cloud composer and workflow?

The cloud workflow doesn't come with a scheduling feature. Apart from that, what are all the differences between these two services in terms of features? In which use case should we prefer the workflow over composer or vice versa?
There are some key differences to consider when choosing between the two solutions :
A Composer instance needs to be in a running state to trigger DAGs and you'll also need to size your Cloud Composer instance based on your usage, You do not need to do this in Cloud Workflows as it is a Serverless service and you pay for anytime a workflow is triggered
Another key difference is that Cloud Composer is really convenient for writing and orchestrating data pipelines because of it's internal scheduler and also because of the provided Operators, You can interact with any Data services inside of GCP.
However, Cloud Workflows interacts with Cloud Functions, wich is a task that Composer cannot do really well.
Both Composer and Workflows support orchestrating multiple services and can handle long running workflows. Despite there being some overlap in the capabilities of these products, each has differentiators that make them well suited to particular use cases.
Composer is most commonly used for orchestrating the transformation of data as part of ELT or data engineering. Workflows, in contrast, is focused on the orchestration of HTTP-based services built with Cloud Functions, Cloud Run, or external APIs.
Composer is designed for orchestrating batch workloads that can handle a delay of a few seconds between task executions. It wouldn’t be suitable if low latency was required in between tasks, whereas Workflows is designed for latency sensitive use cases.
While you don’t have to worry about maintaining Airflow deployments in Composer, you do need to specify how many workers you need for a given Composer environment. Workflows is completely serverless; there is no infrastructure to manage or scale.
For further information refer to this google blog article and this one.

How to share resources(compute engines) among projects in google cloud platform

I am trying to create prototype, where I can share the resources among the projects to run a job within the google cloud platform
Motivation: Let say there are two projects: Project A and Project B.
I want to use the dataproc cluster created in Project A to run a job in Project B.
The project are within the same organisation in the GCP platform.
How do I do that?
There are a few ways to manage resources across projects. Probably the most straightforward way to do this is to:
Create a service account with appropriate permissions across your project(s).
Setup an Airflow connection with the service account you have created.
You can create workflows that use that connection and then specify the project when you create a Cloud Dataproc cluster.
Alternate ways you could do this that come to mind:
Use something like the BashOperator or PythonOperator to execute Cloud SDK commands.
Use an HTTP operator to ping the REST endpoints of the services you want to use
Having said that, the first approach using the operators is likely the easiest by far and would be the recommended way to do what you want.
With respect to Dataproc, when you create a job, it will only bind to clusters within a specific project. It's not possible to create jobs in one project against clusters in another. This is because things like logging, auditing, and other job-related semantics are messy when clusters live in another project.

Continuous Integration on AWS EMR

We have a long running EMR cluster that has multiple libraries installed on it using bootstrap actions. Some of these libraries are under continuous development and their codebase is on GitHub.
I've been looking to plug Travis CI with AWS EMR in a similar way to Travis and CodeDeploy. The idea is to get the code on GitHub tested and deployed automatically to EMR while using bootstrap actions to install the updated libraries on all EMR's nodes.
A solution I came up with is to use an EC2 instance in the middle, where Travis and CodeDeploy can be first used to deploy the code on the instance. After that a lunch script on the instance is triggered to create a new EMR cluster with the updated libraries.
However, the above solution means we need to create a new EMR cluster every time we deploy a new version of the system
Any other suggestions?
You definitely don't want to maintain an EC2 instance to orchestrate a CI/CD process like that. First of all, it introduces a number of challenges because then you need to deal with an entire server instance, keep it maintained, deal with networking, apply monitoring and alerts to deal with availability issues, and even then, you won't have availability guarantees, which may cause other issues. Most of all, maintaining an EC2 instance for a purpose like that simply is unnecessary.
I recommend that you investigate using Amazon CodePipeline with a Lambda Step Function.
The Step Function can be used to orchestrate the provisioning of your EMR cluster in a fully serverless environment. With CodePipeline, you can setup a web hook into your Github repo to pull your code and spin up a new deployment automatically whenever changes are committed to your master Github branch (or whatever branch you specify). You can use EMRFS to sync an S3 bucket or folder to your EMR file system for your cluster and then obtain the security benefits of IAM, as well as additional consistency guarantees that come with EMRFS. With Lambda, you also get seamless integration into other services, such as Kinesis, DynamoDB, and CloudWatch, among many others, that will simplify many administrative and development tasks, as well as enable you to have more sophisticated automation with minimal effort.
There are some great resources and tutorials for using CodePipeline with EMR, as well as in general. Here are some examples:
https://aws.amazon.com/blogs/big-data/implement-continuous-integration-and-delivery-of-apache-spark-applications-using-aws/
https://docs.aws.amazon.com/codepipeline/latest/userguide/tutorials-ecs-ecr-codedeploy.html
https://chalice-workshop.readthedocs.io/en/latest/index.html
There are also great tutorials for orchestrating applications with Lambda Step Functions, including the use of EMR. Here are some examples:
https://aws.amazon.com/blogs/big-data/orchestrate-apache-spark-applications-using-aws-step-functions-and-apache-livy/
https://aws.amazon.com/blogs/big-data/orchestrate-multiple-etl-jobs-using-aws-step-functions-and-aws-lambda/
https://github.com/DavidWells/serverless-workshop/tree/master/lessons-code-complete/events/step-functions
https://github.com/aws-samples/lambda-refarch-imagerecognition
https://github.com/aws-samples/aws-serverless-workshops
In the very worst case, if all of those options fail, such as if you need very strict control over the startup process on the EMR cluster after the EMR cluster completes its bootstrapping, you can always create a Java JAR that is loaded as a final step and then use that to either execute a shell script or use the various Amazon Java libraries to run your provisioning commands. In even this case, you still have no need to maintain your own EC2 instance for orchestration purposes (which, in my opinion, still would be hard to justify even if it was running in a Docker container in Kubernetes) because you can easily maintain that deployment process as well with a fully serverless approach.
There are many great videos from the Amazon re:Invent conferences that you may want to watch to get a jump start before you dive into the workshops. For example:
https://www.youtube.com/watch?v=dCDZ7HR7dms
https://www.youtube.com/watch?v=Xi_WrinvTnM&t=1470s
Many more such videos are available on YouTube.
Travis CI also supports Lambda deployment, as mentioned here: https://docs.travis-ci.com/user/deployment/lambda/

Exploring tools to trigger build script to rollout specific git branch to a subset of the amazon ec2 instances

We have multiple amazon ec2 instances behind a load balancer. Our build script is written in phing and is integrated with git.
We are looking for a tool (like Jenkins or Amazon code deploy) which could display all the active instances currently behind load balancer and then allow us to select some of them (or select a group defined previously) and then trigger either of the following (whichever is better) -
a build script hosted on the same dedicated server where the tool is hosted.
or the respective build scripts hosted on the selected ec2 instances.
We should be able to do the following -
specify a git branch name, optionally, when we trigger the build script for any group of instances.
be able to roll out in batches of boxes, so as to get some time to monitor load, and then move to next batch if all is good. Best way, I guess, would be to specify a size of the batch (e.g. 10), so that the process waits for a user prompt after rollout on every batch completes.
So, if we have to rollout two different git branches to two groups of instances, we should be able to run them in two steps (if we do not specify batch size).
Would like to know about experiences of people who dealt with something similar.
For CodeDeploy, it supports Git (more precisely, GitHub). It also allows you to deploy only to tagged EC2 instances. If combined with custom DeploymentConfig (http://docs.aws.amazon.com/codedeploy/latest/userguide/how-to-create-deployment-configuration.html), you can also control how fast (the size of the batch) to deploy.
I would re-structure the question:
The choices you have for application deployment
and whether the tool has option to perform rolling deployments.
Jenkins is software for CI/CD, which will have to use plugins,custom scripting or leverage an existing orchestration software setup for doing the deployments.
For software orchestration, you have many choices, some of the more famous tools are Chef, puppet, ansible etc.. All of these would need you to manage some kind of centralized setup. All such software support application deployment.
You need to make a decision on whether you would want to invest in maintaining such a setup.
If you decide against such a setup, you have the option of using managed services such as AWS OpsWorks, AWS CodeDeploy, hosted chef etc.
In choosing any of these services, you delegate the management of orchestration software to a vendor, which will ensure the service is up all the time.
AWS code deploy and AWS OpsWorks are managed services on aws and work pretty well on AWS setups.
AWS OpsWorks uses chef under the hood.
AWS CodeDeploy only provides a subset of what OpsWorks provides and is responsible only for deployments. With AWS code deploy you get convenient visualization of your software deployments through AWS console.
With AWS code deploy, you can achieve the goal of partial roll out to ec2 instances.
You can do the same with other tools as well but CodeDeploy on AWS environment will take least amount of work.
CodeDeploy also allows you to deploy from GIT. Please refer to the following aws documentation
http://docs.aws.amazon.com/codedeploy/latest/userguide/github-integ-tutorial.html
The pitfall with code deploy is the fact that the agent that will run on instances has been tested for and is supported for only a limited number of OS combinations.(http://docs.aws.amazon.com/codedeploy/latest/userguide/how-to-run-agent.html#how-to-run-agent-supported-oses)
Also in future if you decide to move away from AWS, you will have to redo the deployment related work.
CodeDeploy service only charges you for the underneath AWS resources.
Please find the link to pricing documentation below:
https://aws.amazon.com/codedeploy/pricing/