They both seem to be recommended CI/CD tools within Google Cloud.. but with similar functionality. Would I use one over the other? Maybe together?
Cloud Build seems to be the de facto tool. While Cloud Deploy says that it can do "pipeline and promotion management."
Both of them are designed as serverless, meaning you don't have to manage the underlying infrastructure of your builds and defining delivery pipelines in a YAML configuration file. However, Cloud Deploy needs a configuration for Skaffold, which Google Cloud Deploy needs in order to perform render and deploy operations.
And according to this documentation,
Google Cloud Deploy is a service that automates delivery of your applications to a series of target environments in a defined sequence.
Cloud Deploy is an opinionated, continuous delivery system currently supporting Kubernetes clusters and Anthos. It picks up after the CI process has completed (i.e. the artifact/images are built) and is responsible for delivering the software to production via a progression sequence defined in a delivery pipeline.
While Google Cloud Build is a service that executes your builds on Google Cloud.
Cloud Build (GCB) is Google's cloud Continuous Integration/Continuous Development (CICD) solution. And takes users code stored in Cloud Source Repositories, GitHub, Bitbucket, or other solutions; builds it; runs tests; and saves the results to an artifact repository like Google Container Registry, Artifactory, or a Google Cloud Storage bucket. Also, supports complex builds with multiple steps, for example, testing and deployments. If you want to add your CI pipeline, it's as easy as adding an additional step to it. Take your Artifacts, either built or stored locally or at your destination and easily deploy it to many services with a deployment strategy of you choice.
Provide more details in order to choose between the two services and it will still depend on your use case. However, their objectives might help to make it easier for you to choose between the two services.
Cloud Build's mission is to help GCP users build better software
faster, more securely by providing a CI/CD workflow automation product for
developer teams and other GCP services.
Cloud Deploy's mission is to make it easier to set up and run continuous
software delivery to a Google Kubernetes Engine environment.
In addtion, refer to this documentation for price information, Cloud Build pricing and Cloud Deploy pricing.
The cloud workflow doesn't come with a scheduling feature. Apart from that, what are all the differences between these two services in terms of features? In which use case should we prefer the workflow over composer or vice versa?
There are some key differences to consider when choosing between the two solutions :
A Composer instance needs to be in a running state to trigger DAGs and you'll also need to size your Cloud Composer instance based on your usage, You do not need to do this in Cloud Workflows as it is a Serverless service and you pay for anytime a workflow is triggered
Another key difference is that Cloud Composer is really convenient for writing and orchestrating data pipelines because of it's internal scheduler and also because of the provided Operators, You can interact with any Data services inside of GCP.
However, Cloud Workflows interacts with Cloud Functions, wich is a task that Composer cannot do really well.
Both Composer and Workflows support orchestrating multiple services and can handle long running workflows. Despite there being some overlap in the capabilities of these products, each has differentiators that make them well suited to particular use cases.
Composer is most commonly used for orchestrating the transformation of data as part of ELT or data engineering. Workflows, in contrast, is focused on the orchestration of HTTP-based services built with Cloud Functions, Cloud Run, or external APIs.
Composer is designed for orchestrating batch workloads that can handle a delay of a few seconds between task executions. It wouldn’t be suitable if low latency was required in between tasks, whereas Workflows is designed for latency sensitive use cases.
While you don’t have to worry about maintaining Airflow deployments in Composer, you do need to specify how many workers you need for a given Composer environment. Workflows is completely serverless; there is no infrastructure to manage or scale.
For further information refer to this google blog article and this one.
I have been reading for few weeks for different approaches for ML in production. I decided to test Kubeflow and I decided to test it on GCP. I started to deploy Kubeflow on GCP using the guiidline on official kubeflow website(here https://www.kubeflow.org/docs/gke/). I run into a lot of issues and it was quit hard to fix them. I started to look into a better approach and I noticed that GCP AI platform now offers deploying Kubeflow pipelines with just few simple steps. (https://cloud.google.com/ai-platform/pipelines/docs/connecting-with-sdk.)
After easily setting up this, I had few question and doubts. If it is this much easy to set up and deploy Kubeflow why we have to go through such a cumbersome way as suggested in the kubeflow official website. Since creating Kubeflow pipeline on GCP means basically I am deploying Kubeflow on GCP, does that mean I can access other Kubeflow services like Katib?
Elnaz
The kubeflow official website provides the required information in detailed way and where as in google cloud it directly provides you the services with possible ready solution.
Referring to will fuks document it says YES, you can able to access katlib on GCP
The GCP managed service of Kubeflow Pipelines is just that. You won't have a lot of access to the cluster to make changes. I've deployed a Kubeflow cluster that can still reach the AI Hub as well.
I believe they have plans to expand what can be deployed in the AI Platform but if you don't want to wait, the self-deployment is possible (but not easy) IMO.
I am trying to build an app where the user is able to upload a file to cloud storage. This would then trigger a model training process (and predicting later on). Initially I though I could do this with cloud functions/pubsub and cloudml, but it seems that cloud functions are not able to trigger gsutil commands which is needed for cloudml.
Is my only option to enable cloud-composer and attach GPUs to a kubernetes node and create a cloud function that triggers a dag to boot up a pod on the node with GPUs and mounting the bucket with the data? Seems a bit excessive but I can't think of another way currently.
You're correct. As for now, there's no possibility to execute gsutil command from a Google Cloud Function:
Cloud Functions can be written in Node.js, Python, Go, and Java, and are executed in language-specific runtimes.
I really like your second approach with triggering the DAG.
Another idea that comes to my mind is to interact with GCP Virtual Machines within Cloud Composer through the Python operator by using the Compute Engine Pyhton API. You can find more information in automating infrastructure and taking a deep technical dive into the core features of Cloud Composer here.
Another solution that you can think of is Kubeflow, which aims to make running ML workloads on Kubernetes. Kubeflow adds some resources to your cluster to assist with a variety of tasks, including training and serving models and running Jupyter Notebooks. Please, have a look on Codelabs tutorial.
I hope you find the above pieces of information useful.
I am using Google Data Flow to implement an ETL data ware house solution.
Looking into google cloud offering, it seems DataProc can also do the same thing.
It also seems DataProc is little bit cheaper than DataFlow.
Does anybody know the pros / cons of DataFlow over DataProc
Why does google offer both?
Yes, Cloud Dataflow and Cloud Dataproc can both be used to implement ETL data warehousing solutions.
An overview of why each of these products exist can be found in the Google Cloud Platform Big Data Solutions Articles
Quick takeaways:
Cloud Dataproc provides you with a Hadoop cluster, on GCP, and access to Hadoop-ecosystem tools (e.g. Apache Pig, Hive, and Spark); this has strong appeal if you are already familiar with Hadoop tools and have Hadoop jobs
Cloud Dataflow provides you with a place to run Apache Beam based jobs, on GCP, and you do not need to address common aspects of running jobs on a cluster (e.g. Balancing work, or Scaling the number of workers for a job; by default, this is automatically managed for you, and applies to both batch and streaming) -- this can be very time consuming on other systems
Apache Beam is an important consideration; Beam jobs are intended to be portable across "runners," which include Cloud Dataflow, and enable you to focus on your logical computation, rather than how a "runner" works -- In comparison, when authoring a Spark job, your code is bound to the runner, Spark, and how that runner works
Cloud Dataflow also offers the ability to create jobs based on "templates," which can help simplify common tasks where the differences are parameter values
Here are three main points to consider while trying to choose between Dataproc and Dataflow
Provisioning
Dataproc - Manual provisioning of clusters
Dataflow - Serverless. Automatic provisioning of clusters
Hadoop Dependencies
Dataproc should be used if the processing has any dependencies to tools in the Hadoop ecosystem.
Portability
Dataflow/Beam provides a clear separation between processing logic and the underlying execution engine. This helps with portability across different execution engines that support the Beam runtime, i.e. the same pipeline code can run seamlessly on either Dataflow, Spark or Flink.
This flowchart from the google website explains how to go about choosing one over the other.
https://cloud.google.com/dataflow/images/flow-vs-proc-flowchart.svg
Further details are available in the below link
https://cloud.google.com/dataproc/#fast--scalable-data-processing
Same reason as why Dataproc offers both Hadoop and Spark: sometimes one programming model is the best fit for the job, sometimes the other. Likewise, in some cases the best fit for the job is the Apache Beam programming model, offered by Dataflow.
In many cases, a big consideration is that one already has a codebase written against a particular framework, and one just wants to deploy it on the Google Cloud, so even if, say, the Beam programming model is superior to Hadoop, someone with a lot of Hadoop code might still choose Dataproc for the time being, rather than rewriting their code on Beam to run on Dataflow.
The differences between Spark and Beam programming models are quite large, and there are a lot of use cases where each one has a big advantage over the other. See https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison .
Cloud Dataflow is a serverless data processing service that runs jobs written using the Apache Beam libraries. When you run a job on Cloud Dataflow, it spins up a cluster of virtual machines, distributes the tasks in your job to the VMs, and dynamically scales the cluster based on how the job is performing. It may even change the order of operations in your processing pipeline to optimize your job.
So use cases are ETL (extract, transfer, load) job between various data sources / data bases. For example load big files from Cloud Storage into BigQuery.
Streaming works based on subscription to PubSub topic, so you can listen to real time events (for example from some IoT devices) and then further process.
Interesting concrete use case of Dataflow is Dataprep. Dataprep is cloud tool on GCP used for exploring, cleaning, wrangling (large) datasets. When you define actions you want to do with your data (like formatting, joining etc), job is run under the hood on Dataflow.
Cloud Dataflow also offers the ability to create jobs based on "templates," which can help simplify common tasks where the differences are parameter values.
Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data.
Super fast — Without using Dataproc, it can take from five to 30
minutes to create Spark and Hadoop clusters on-premises or through
IaaS providers. By comparison, Dataproc clusters are quick to start,
scale, and shutdown, with each of these operations taking 90 seconds
or less, on average. This means you can spend less time waiting for
clusters and more hands-on time working with your data.
Integrated — Dataproc has built-in integration with other Google
Cloud Platform services, such as BigQuery, Cloud Storage, Cloud
Bigtable, Cloud Logging, and Cloud Monitoring, so you have more than
just a Spark or Hadoop cluster—you have a complete data platform.
For example, you can use Dataproc to effortlessly ETL terabytes of
raw log data directly into BigQuery for business reporting.
Managed — Use Spark and Hadoop clusters without the assistance of an
administrator or special software. You can easily interact with
clusters and Spark or Hadoop jobs through the Google Cloud Console,
the Cloud SDK, or the Dataproc REST API. When you're done with a
cluster, you can simply turn it off, so you don’t spend money on an
idle cluster. You won’t need to worry about losing data, because
Dataproc is integrated with Cloud Storage, BigQuery, and Cloud
Bigtable.
Simple and familiar — You don’t need to learn new tools or APIs to
use Dataproc, making it easy to move existing projects into Dataproc
without redevelopment. Spark, Hadoop, Pig, and Hive are frequently
updated, so you can be productive faster.
If you want to migrate from your existing Hadoop/Spark cluster to the cloud, or take advantage of so many well-trained Hadoop/Spark engineers out there in the market, choose Cloud Dataproc; if you trust Google's expertise in large scale data processing and take their latest improvements for free, choose DataFlow.
Here are three main points to consider while trying to choose between Dataproc and Dataflow
Provisioning
Dataproc - Manual provisioning of clusters
Dataflow - Serverless. Automatic provisioning of clusters
Hadoop Dependencies
Dataproc should be used if the processing has any dependencies to tools in the Hadoop ecosystem.
Portability
Dataflow/Beam provides a clear separation between processing logic and the underlying execution engine. This helps with portability across different execution engines that support the Beam runtime, i.e. the same pipeline code can run seamlessly on either Dataflow, Spark or Flink.
Cloud Dataproc and Cloud Dataflow can both be used for data processing, and there’s overlap in their batch and streaming capabilities. You can decide which product is a better fit for your environment.
Cloud Dataproc is good for environments dependent on specific Apache big data components:
- Tools/packages
- Pipelines
- Skill sets of existing resources
Cloud Dataflow is typically the preferred option for green field environments:
- Less operational overhead
- Unified approach to development of batch or streaming pipelines
- Uses Apache Beam
- Supports pipeline portability across Cloud Dataflow, Apache Spark, and Apache Flink as runtimes.
See more details here https://cloud.google.com/dataproc/
Pricing comparision:
DataProc
Dataflow
If you want to calculate and compare cost of more GCP resources, please refer this url https://cloud.google.com/products/calculator/
One of the other important difference is:
Cloud Dataproc:
Data mining and analysis in datasets of known size
Cloud Dataflow:
Manage datasets of unpredictable size
see
Cloud Dataflow
Is a serverless data processing service that runs jobs written using
the Apache Beam libraries.
When you run a job on Cloud Dataflow it gets operated like this:
It spins up a cluster of virtual machines
Distributes the tasks in your job to the VMs, and dynamically scale the cluster based on how the job is performing
Dataflow may even change the order of operations in your processing pipeline to optimize your job.
It supports both batch and streaming Jobs. So use cases are ETL (extract, transfer, load) jobs between various data sources/databases.
For example, load big files from Cloud Storage into Big Query.
Streaming works based on subscription to Pub-Sub topic, so you can listen to real-time events (for example from some IoT devices) and then further process the data.
An interesting concrete use case of Dataflow is Data prep.
Data prep is a cloud tool on GCP used for exploring, cleaning, and wrangling (large) datasets. When you define the actions you want to perform on your data (like formatting, joining etc.), the job run under the hood on Dataflow.
Cloud Dataflow also offers the ability to create jobs based on "templates" which can help simplify common tasks where the differences are parameter values.
Data proc
Is a managed Spark and Hadoop service that lets you take advantage of
open-source data tools for batch processing, querying, streaming, and
machine learning.
Data proc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data.