How to migrate local data workflow to Google Cloud? - google-cloud-platform

We have a Python data pipeline that run in our server. It grab data from various sources, aggregate and write data to sqlite databases. The daily runtime is just 1 hours and network maybe 100mb at most. What are our options to migrate this to Google Cloud? We would like to have more reliable scheduling, cloud database and better data analytics options from the data (powerful dashboard and visualization) and easy development. Should we go with serverless or server? Is the pricing free for such low usage?

for a lift and shift option, you can run your python workload on the Google Compute Engine, which is a virtual machine, but for best use of Google Cloud, i suggest you to:
Spin up a Google Compute Engine
Run your Python Workload
Save your data on Google Big Query
Shutdown your VM
Schedule it using the Cloud Scheduler
Here is a tutorial from Google on how to do it:
https://cloud.google.com/scheduler/docs/start-and-stop-compute-engine-instances-on-a-schedule

GCP on a shoestring budget:
Google Gives you $300 to spend for first 12 months and there are some services which gives you free usage per month: https://cloud.google.com/free/docs/gcp-free-tier
For example:
You can use BigQuery free-of-charge 1 TB of querying per month and 10 GB of storage each month.
Here's an excellent video on making the most of out of GCP Free tiers: https://www.youtube.com/watch?v=N2OG1w6bGFo&t=818s
Approach to migration:
When moving to cloud you typically choose from one of the following approaches:
1) Rehost (lift-and-shift) no modification to code or architecture
2) Replatform - with minor modifications to code
3) Refactor - with modifications to code and architecture
Obviously you'll get the most cloud benefits (i.e. performance and cost efficiency) with option 3 but it will take longer whereas option 1 is quicker with least amount of benefits.
You can use Cloud Composer for scheduling which is effectively managed apache airflow service. It will allow you to manage batch, streaming and schedule tasks.
Visualisation can be done through Google Data Studio, which can use BigQuery as datasource. Data Studio is free but querying on BigQuery will be chargeable.
BigQuery for data-analytics.
For database you can migrate to managed CloudSQL which supports Postgres and MySQL database types.
Typically serverless stuff is likely to be cost effective if you can accommodate it which obviously will fall into option 3) refactor.

There is several requirement to take care before migrating like: Is all your datasources are reachable by a cloud platform?
About the storage and analytics, BigQuery is an amazing product, and work very well with denormalized data. Is your data can be denormalized? Is your job required transactional capabilities?
Is your data need to be requested on website? BigQuery is powerful for analytics but there is about 1s of query warming, not acceptable on website. It's not like CLoud SQL (MySQL or PostgreSQL) response time which is in millis, but limited to some TB (and having good response time with TB in Cloud SQL is a challenge!)
If it's only for dashboarding, you can use Datastudio, it's free and you can cache your BigQuery data with BI-Engine for more responsive dashboards.
If all of this requirements works for you, and if your datasources are publicly accessible on internet (I mean no VPN requirement for accessing them), I can propose you a full serverless solution. This solution is a side use of Google Cloud Service, and that works well!
I wrote an article on similar use and you can take inspiration on it. Cloud Build allows you to run CI pipeline, and you can use Custom Builder: it's a container that you build yourself and that you can run on Cloud Build.
By the way,
Package your current workflow in a container compliant with Cloud Build, and write your Cloud Build jobs (don't forget to set the right timeout value)
Create a Cloud Function or Cloud Run (if you prefer container) that run Cloud Build; with optionally some substitutions variable for customizing your run.
Set up a Cloud Scheduler to trigger every day your Cloud Run or Cloud Function
Out of BigQuery cost, this pattern cost 0! you have 120 free minutes per day (per billing account) with Cloud Build, Cloud Scheduler is free (up to 3 scheduler per billing account) and Cloud Function/Cloud Run have a huge free tier (here only run some milliseconds).
Streaming to BigQuery is not free but affordable. Half of cent for 100Mb!!
Note: Cloud Run will propose, a day, long running jobs. By the way you could reuse your Cloud Builder container into Cloud Run when the feature will be released. Today, I propose a workaround of this

In Google Cloud Platform BigQuery is great serverless choice - you can start small and grow over time.
With partitioning, cluster and other improvements, we've been successfully using it with UI (4-8k queries per day) with most queries completing under second.
You can also get all data seamlessly ingested from the various sources with millions of files per day to one or many tables with BqTail

Related

Automated BigTable backups

A BigTable table can be backed up through GCP for up to 30 days.
(https://cloud.google.com/bigtable/docs/backups)
Is it possible to have a custom automatic backup policy?
i.e. trigger automatic backups every X days & keep up to 3 copies at a time.
As mentioned in the comment, the link provides a solution which involves the use of the following GCP Products:
Cloud Scheduler: trigger tasks with a cron-based schedule
Cloud Pub/Sub: pass the message request from Cloud Scheduler to Cloud Functions
Cloud Functions: initiate an operation for creating a Cloud Bigtable backup
Cloud Logging and Monitoring (optional).
Full guide can also be seen on GitHub.
This is a good solution since you have a certain requirement that should be done with client libraries, because Big Table doesn't have an API that sets 3 copies at a time.
For normal use cases however, such as triggering automatic backups every X days, there's another solution such as calling the backups.create directly by creating a Cloud Scheduler with HTTP similar to what's done in this answer.
Here is another thought on a solution:
Instead of using three GCP Products, if you are already using k8s or GKE you can replace all this functionality with a k8s CronJob. Put the BigTable API calls in a container and deploy it on a schedule using the CronJob.
In my opinion, it is a simpler solution if you are already using kubernetes.

What are options to run multiple cloud sql and bigquery queries in a predefined sequence

I need to perform a periodic data purge and data load operations between GCP BigQuery and GCP CloudSql.
This involves running multiple queries in GCP BigQuery and GCP cloud SQL in a predetermined sequence and using the query results from earlier queries in subsequent queries
I am considering a few options as described below
Option1 : Use BigQuery "Scheduled Queries" that uses "Federated query" (https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries). This is good as far job involves trigger "read only" queries in gcp cloud SQL database and run multiple queries in gcp BiqQuery.
However since my operation involves purging data from GCP cloud SQL so federated queries are ruled out.
Options 2: Another option I am considering is to use a gcp compute linux engine VM as my controller for performing operations that span across a gcp cloud SQL mysql database and GCP bigQuery.
I can run a cron job to schedule the operation
As far as running gcp cloud SQL queries from a gcp compute engine VM goes, that is well documented by google in a tutorial "Connect from a VM instance" ( Learn how to connect your Cloud SQL instance from a Compute Engine VM instance)
And , for trigering the gcp Big Query queries, bq command line tool (https://cloud.google.com/bigquery/docs/bq-command-line-tool) provides a good option.
This should allow me to run a sequence of interlaced BigQuery and gcp cloud SQL
Do you see any gotchas in "option 2" described above that I am contemplating.
Is there any other option that you can suggest ? I wonder if cloud dataflow is an appropriate solution for a task that involves running queries across multiple databases ( cloud SQL and BigQuery in this case) and using the intermediate query results in subsequent queries
Thinking about your option 2, I probably would consider DataFlow, Cloud Functions, Cloud Run, as the main backbone for those operations instead of VMs. You might find serverless solutions much cheaper and more reliable depending on your wider context, and "how frequently" the "periodic" process is to run.
On the other hand, if you (or your company) has already relevant experience in "some code" on VM, but no skills, knowledge and experience in the serverless solutions, the "education overhead" can increase the overall cost of this path.
To orchestrate your queries, you need an orchestrator. There are 2 on Google Cloud:
The big one Cloud Composer, entreprise grade (feature and cost!)
The new one: Cloud Workflows. Serverless and easy to use
I recommend you to use Cloud Workflows. You can create your workflow of call that you perform to BigQuery (federated queries or not)
If you need to update/delete data in Cloud SQL, I recommend you to create a proxy Cloud Functions (or Cloud Run) that take in parameter a SQL query and execute it on Cloud SQL.
You can call your Cloud FUnctions (or Cloud Run) with Workflow, it's only a HTTP call to perform.
Workflow has also the feature to handle and to propose some processing capacity on the answer gather from your API Calls. So you can parse the response, iterate on it and call the subsequent step, even by injecting some data coming from previous steps.

DataBricks + Kedro Vs GCP + Kubeflow Vs Server + Kedro + Airflow

We are deploying a data consortium between more than 10 companies. Wi will deploy several machine learning models (in general advanced analytics models) for all the companies and we will administrate all the models. We are looking for a solution that administrates several servers, clusters and data science pipelines. I love kedro, but not sure what is the best option to administrate all while using kedro.
In summary, we are looking for the best solution to administrate several models, tasks and pipelines in different servers and possibly Spark clusters. Our current options are:
AWS as our data warehouse and Databricks for administrating servers, clusters and tasks. I don't feel that the notebooks of databricks are a good solution for building pipelines and to work collaboratively, so I would like to connect kedro to databricks (is it good? is it easy to schedule the run of the kedro pipelines using databricks?)
Using GCP for data warehouse and use kubeflow (iin GCP) for deploying models and the administration and the schedule of the pipelines and the needed resources
Setting up servers from ASW or GCP, install kedro and schedule the pipelines with airflow (I see a big problem administrating 20 servers and 40 pipelines)
I would like to know if someone knows what is the best option between these alternatives, their downsides and advantages, or if there are more possibilities.
I'll try and summarise what I know, but be aware that I've not been part of a KubeFlow project.
Kedro on Databricks
Our approach was to build our project with CI and then execute the pipeline from a notebook. We did not use the kedro recommended approach of using databricks-connect due to the large price difference between Jobs and Interactive Clusters (which are needed for DB-connect). If you're working on several TB's of data, this quickly becomes relevant.
As a DS, this approach may feel natural, as a SWE though it does not. Running pipelines in notebooks feels hacky. It works but it feels non-industrialised. Databricks performs well in automatically spinning up and down clusters & taking care of the runtime for you. So their value add is abstracting IaaS away from you (more on that later).
GCP & "Cloud Native"
Pro: GCP's main selling point is BigQuery. It is an incredibly powerful platform, simply because you can be productive from day 0. I've seen people build entire web API's on top of it. KubeFlow isn't tied to GCP so you could port this somewhere else later on. Kubernetes will also allow you to run anything else you wish on the cluster, API's, streaming, web services, websites, you name it.
Con: Kubernetes is complex. If you have 10+ engineers to run this project long-term, you should be OK. But don't underestimate the complexity of Kubernetes. It is to the cloud what Linux is to the OS world. Think log management, noisy neighbours (one cluster for web APIs + batch spark jobs), multi-cluster management (one cluster per department/project), security, resource access etc.
IaaS server approach
Your last alternative, the manual installation of servers is one I would recommend only if you have a large team, extremely large data and are building a long-term product who's revenue can sustain the large maintenance costs.
The people behind it
How does the talent market look like in your region? If you can hire experienced engineers with GCP knowledge, I'd go for the 2nd solution. GCP is a mature, "native" platform in the sense that it abstracts a lot away for customers. If your market has mainly AWS engineers, that may be a better road to take. If you have a number of kedro engineers, that also has relevance. Note that kedro is agnostic enough to run anywhere. It's really just python code.
Subjective advise:
Having worked mostly on AWS projects and a few GCP projects, I'd go for GCP. I'd use the platform's components (BigQuery, Cloud Run, PubSub, Functions, K8S) as a toolbox to choose from and build an organisation around that. Kedro can run in any of these contexts, as a triggered job by the Scheduler, as a container on Kubernetes or as a ETL pipeline bringing data into (or out of) BigQuery.
While Databricks is "less management" than raw AWS, it's still servers to think about and VPC networking charges to worry over. BigQuery is simply GB queried. Functions are simply invocation count. These high level components will allow you to quickly show value to customers and you only need to go deeper (RaaS -> PaaS -> IaaS) as you scale.
AWS also has these higher level abstractions over IaaS but in general, it appears (to me) that Google's offering is the most mature. Mainly because they have published tools they've been using internally for almost a decade whereas AWS has built new tools for the market. AWS is the king of IaaS though.
Finally, a bit of content, two former colleagues have discussed ML industrialisation frameworks earlier this fall

Can I somehow make google bq work faster if I deploy my app to GCP

While playing with bq API. I noticed that getting big table is rather time consuming from my local machine. That is when the question arose if it possible to make it work quicker if I deploy my project to GCP?
I looked through all the provided documentation from GCP and hadn't found any special treatment of google bq in case my app is deployed on GCP:
https://cloud.google.com/bigquery/docs/quickstarts/quickstart-client-libraries
https://cloud.google.com/bigquery/docs/query-overview
Is it correct that google bq doesn't work faster and it can't be made to work faster if it is deployed on GCP?
My concern is that I use http requests to remote host to deliver data from google bq, when GCP can use some other protocols and techniques for data delivery if google bq host is local to it.
Check out the storage API for a faster way to read from tables.
BigQuery supports an API called tabledata.list, which is free, but not very fast at transferring a significant number of rows. BigQuery recently released the new storage API linked above in beta, however, which has a different backend implementation and supports parallel reads for faster transfer of data.

What is the difference between Google Cloud Dataflow and Google Cloud Dataproc?

I am using Google Data Flow to implement an ETL data ware house solution.
Looking into google cloud offering, it seems DataProc can also do the same thing.
It also seems DataProc is little bit cheaper than DataFlow.
Does anybody know the pros / cons of DataFlow over DataProc
Why does google offer both?
Yes, Cloud Dataflow and Cloud Dataproc can both be used to implement ETL data warehousing solutions.
An overview of why each of these products exist can be found in the Google Cloud Platform Big Data Solutions Articles
Quick takeaways:
Cloud Dataproc provides you with a Hadoop cluster, on GCP, and access to Hadoop-ecosystem tools (e.g. Apache Pig, Hive, and Spark); this has strong appeal if you are already familiar with Hadoop tools and have Hadoop jobs
Cloud Dataflow provides you with a place to run Apache Beam based jobs, on GCP, and you do not need to address common aspects of running jobs on a cluster (e.g. Balancing work, or Scaling the number of workers for a job; by default, this is automatically managed for you, and applies to both batch and streaming) -- this can be very time consuming on other systems
Apache Beam is an important consideration; Beam jobs are intended to be portable across "runners," which include Cloud Dataflow, and enable you to focus on your logical computation, rather than how a "runner" works -- In comparison, when authoring a Spark job, your code is bound to the runner, Spark, and how that runner works
Cloud Dataflow also offers the ability to create jobs based on "templates," which can help simplify common tasks where the differences are parameter values
Here are three main points to consider while trying to choose between Dataproc and Dataflow
Provisioning
Dataproc - Manual provisioning of clusters
Dataflow - Serverless. Automatic provisioning of clusters
Hadoop Dependencies
Dataproc should be used if the processing has any dependencies to tools in the Hadoop ecosystem.
Portability
Dataflow/Beam provides a clear separation between processing logic and the underlying execution engine. This helps with portability across different execution engines that support the Beam runtime, i.e. the same pipeline code can run seamlessly on either Dataflow, Spark or Flink.
This flowchart from the google website explains how to go about choosing one over the other.
https://cloud.google.com/dataflow/images/flow-vs-proc-flowchart.svg
Further details are available in the below link
https://cloud.google.com/dataproc/#fast--scalable-data-processing
Same reason as why Dataproc offers both Hadoop and Spark: sometimes one programming model is the best fit for the job, sometimes the other. Likewise, in some cases the best fit for the job is the Apache Beam programming model, offered by Dataflow.
In many cases, a big consideration is that one already has a codebase written against a particular framework, and one just wants to deploy it on the Google Cloud, so even if, say, the Beam programming model is superior to Hadoop, someone with a lot of Hadoop code might still choose Dataproc for the time being, rather than rewriting their code on Beam to run on Dataflow.
The differences between Spark and Beam programming models are quite large, and there are a lot of use cases where each one has a big advantage over the other. See https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison .
Cloud Dataflow is a serverless data processing service that runs jobs written using the Apache Beam libraries. When you run a job on Cloud Dataflow, it spins up a cluster of virtual machines, distributes the tasks in your job to the VMs, and dynamically scales the cluster based on how the job is performing. It may even change the order of operations in your processing pipeline to optimize your job.
So use cases are ETL (extract, transfer, load) job between various data sources / data bases. For example load big files from Cloud Storage into BigQuery.
Streaming works based on subscription to PubSub topic, so you can listen to real time events (for example from some IoT devices) and then further process.
Interesting concrete use case of Dataflow is Dataprep. Dataprep is cloud tool on GCP used for exploring, cleaning, wrangling (large) datasets. When you define actions you want to do with your data (like formatting, joining etc), job is run under the hood on Dataflow.
Cloud Dataflow also offers the ability to create jobs based on "templates," which can help simplify common tasks where the differences are parameter values.
Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data.
Super fast — Without using Dataproc, it can take from five to 30
minutes to create Spark and Hadoop clusters on-premises or through
IaaS providers. By comparison, Dataproc clusters are quick to start,
scale, and shutdown, with each of these operations taking 90 seconds
or less, on average. This means you can spend less time waiting for
clusters and more hands-on time working with your data.
Integrated — Dataproc has built-in integration with other Google
Cloud Platform services, such as BigQuery, Cloud Storage, Cloud
Bigtable, Cloud Logging, and Cloud Monitoring, so you have more than
just a Spark or Hadoop cluster—you have a complete data platform.
For example, you can use Dataproc to effortlessly ETL terabytes of
raw log data directly into BigQuery for business reporting.
Managed — Use Spark and Hadoop clusters without the assistance of an
administrator or special software. You can easily interact with
clusters and Spark or Hadoop jobs through the Google Cloud Console,
the Cloud SDK, or the Dataproc REST API. When you're done with a
cluster, you can simply turn it off, so you don’t spend money on an
idle cluster. You won’t need to worry about losing data, because
Dataproc is integrated with Cloud Storage, BigQuery, and Cloud
Bigtable.
Simple and familiar — You don’t need to learn new tools or APIs to
use Dataproc, making it easy to move existing projects into Dataproc
without redevelopment. Spark, Hadoop, Pig, and Hive are frequently
updated, so you can be productive faster.
If you want to migrate from your existing Hadoop/Spark cluster to the cloud, or take advantage of so many well-trained Hadoop/Spark engineers out there in the market, choose Cloud Dataproc; if you trust Google's expertise in large scale data processing and take their latest improvements for free, choose DataFlow.
Here are three main points to consider while trying to choose between Dataproc and Dataflow
Provisioning
Dataproc - Manual provisioning of clusters
Dataflow - Serverless. Automatic provisioning of clusters
Hadoop Dependencies
Dataproc should be used if the processing has any dependencies to tools in the Hadoop ecosystem.
Portability
Dataflow/Beam provides a clear separation between processing logic and the underlying execution engine. This helps with portability across different execution engines that support the Beam runtime, i.e. the same pipeline code can run seamlessly on either Dataflow, Spark or Flink.
Cloud Dataproc and Cloud Dataflow can both be used for data processing, and there’s overlap in their batch and streaming capabilities. You can decide which product is a better fit for your environment.
Cloud Dataproc is good for environments dependent on specific Apache big data components:
- Tools/packages
- Pipelines
- Skill sets of existing resources
Cloud Dataflow is typically the preferred option for green field environments:
- Less operational overhead
- Unified approach to development of batch or streaming pipelines
- Uses Apache Beam
- Supports pipeline portability across Cloud Dataflow, Apache Spark, and Apache Flink as runtimes.
See more details here https://cloud.google.com/dataproc/
Pricing comparision:
DataProc
Dataflow
If you want to calculate and compare cost of more GCP resources, please refer this url https://cloud.google.com/products/calculator/
One of the other important difference is:
Cloud Dataproc:
Data mining and analysis in datasets of known size
Cloud Dataflow:
Manage datasets of unpredictable size
see
Cloud Dataflow
Is a serverless data processing service that runs jobs written using
the Apache Beam libraries.
When you run a job on Cloud Dataflow it gets operated like this:
It spins up a cluster of virtual machines
Distributes the tasks in your job to the VMs, and dynamically scale the cluster based on how the job is performing
Dataflow may even change the order of operations in your processing pipeline to optimize your job.
It supports both batch and streaming Jobs. So use cases are ETL (extract, transfer, load) jobs between various data sources/databases.
For example, load big files from Cloud Storage into Big Query.
Streaming works based on subscription to Pub-Sub topic, so you can listen to real-time events (for example from some IoT devices) and then further process the data.
An interesting concrete use case of Dataflow is Data prep.
Data prep is a cloud tool on GCP used for exploring, cleaning, and wrangling (large) datasets. When you define the actions you want to perform on your data (like formatting, joining etc.), the job run under the hood on Dataflow.
Cloud Dataflow also offers the ability to create jobs based on "templates" which can help simplify common tasks where the differences are parameter values.
Data proc
Is a managed Spark and Hadoop service that lets you take advantage of
open-source data tools for batch processing, querying, streaming, and
machine learning.
Data proc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data.