GCP Dataflow, Dataproc, Bigtable - google-cloud-platform

I am selecting services to write and transform JSON messages from Cloud Pub/Sub to BigQuery for a data pipeline on Google Cloud. I want to minimize service costs. I also want to monitor and accommodate input data volume that will vary in size with minimal manual intervention. What should I do?
A. Use Cloud Dataproc to run your transformations. Monitor CPU utilization for the cluster. Resize the number of worker nodes in your cluster via the command line.
B. Use Cloud Dataproc to run your transformations. Use the diagnose command to generate an operational output archive. Locate the bottleneck and adjust cluster resources.
C. Use Cloud Dataflow to run your transformations. Monitor the job system lag with Stackdriver. Use the default autoscaling setting for worker instances.
D. Use Cloud Dataflow to run your transformations. Monitor the total execution time for a sampling of jobs. Configure the job to use non-default Compute Engine machine types when needed.

C!
Use Dataflow on pubsub to transform your data and let it write rows into BQ. You can monitor the ETL pipeline straight from data flow and use stackdriver on top. Stackdriver can also be used to start events etc.
Use autoscaling to minimize the number of manual actions. Basically when this solution is setup correctly, it doesn't need work at all.

Related

Triggering an alert when multiple dataflow jobs run in parallel in GCP

I am using google cloud dataflow to execute some resource intensive dataflow jobs. And at a given time, my system must execute no more than 2 jobs in parallel.
Since each job is quite resource intensive, I am looking for a way to trigger an alert when more than 2 dataflow jobs are running.
I tried implementing a custom_count which increments after the start of each job. But custom_couter only display after the job has executed. And it might be too late to trigger an alert by then.
You could modify the quota dataflow.googleapis.com/job_count of the project to be limited to 1, and no two jobs could run parallel in that project. The quota is at the project level, it would not affect other projects.
Another option is to use an GCP monitoring system that is observing the running Dataflow jobs. You can e.g. use Elastic Cloud (available via Marketplace) to load all relevant Metrics and Logs. Elastic can visualize and alert on every state you are interested in.
I found this terraform project very helpful in order to get started with that approach.

AWS Batch vs Spring Batch

I have been planning to migrate my Batch processing from Spring Batch to AWS Batch. Can someone give me the reasons to Choose AWS Batch over Spring Batch?
Whilst both these things will play a role in orchestrating your batch workloads a key difference is that AWS Batch will also manage the infrastructure you need to run the jobs/pipeline. AWS Batch lets you to tailor the underlying cloud instances, or specifcy a broad array of instance types that will work for you. And it'll let you make trade-offs: you can task it with managing a bag of EC2 Spot instances for you (for example), and then ask it to optimize time-to-execution over price (or prefer price to speed).
(For full disclosure, I work for the engineering team that builds AWS Batch).
I believe both work at different levels .Spring batch provide framework that reduce the boiler plate code that you need in order to write a batch job.For eg. saving the state of job in Job repository that provide restartability.
On the contrary, AWS batch is an infrastructure framework that helps in managing infra and set some environment variable that help differentiate master node from slave node.
In my opinion both can work together to write a full fledged cost effective batch job at scale on AWS cloud.
Aws Batch is full blown SaaS solution for batch processing,
It has inbuilt
Queue with priority options
Runtime, which can be self managed and auto managed
Job repo, docker images for the job definitions
Monitoring , dashboards, integration with other AWS services like SNS and from SNS to where ever you want
On the other hand, batch is a framework which would still need some of your efforts to manage it all. like employing a queue, scaling is your headache, monitoring etc
My take is , if your company or app is on AWS , go for AWS batch , you will save months of time and get to scalability to a million jobs per day in no time . If you are on-perm or private go for spring batch with some research

Is there anyway to auto scale-out based on schedule not policy for dataproc in GCP (google cloud platform)

It's straight forward.
The system i am developing will be provisioned in dataproc (google cloud platform)
And due to some business feature about our system, we can figure expected data processed in the future.
So, i just want to do scale-out dataproc based on it.
Is there any idea to scale-out dataproc by scheduler or API implemented by our side ??
Thanks in advance.
You can update the size of a Dataproc cluster using a gcloud dataproc clusters update or an equivalent API call in one of the Dataproc client libraries, see the documentation here. It also supports setting an auto scaling policy, where the cluster can grow and shrink based on the load on the cluster.
You should use combination of Cloud scheduler and manual scale out method for this situation.
Create Cloud scheduler job runs manual scale out API call like below.
You must know about cron syntax to use cron scheduling for your scailing job.
In this scenario, You can only specify number of nodes your cluster uses, not number of incremental. This is a limitation of this solution. Below is example of API call.
PATCH /v1/projects/project-id/regions/us-central1/clusters/example-cluster?updateMask=config.worker_config.num_instances,config.secondary_worker_config.num_instances
{
"config": {
"workerConfig": {
"numInstances": 4
},
"secondaryWorkerConfig": {
"numInstances": 2
}
},
"labels": null
}
For a detailed description of Dataproc API, See here.

Dataprep vs Dataflow vs Dataproc

To perform source data preparation, data transformation or data cleansing, in what scenario should we use Dataprep vs Dataflow vs Dataproc?
Data preparation/transformation/cleaning tasks can all be seen as ETL processes, implementable with any of the products you mention. This older answer covers the basics of the Dataflow vs Dataproc question and includes this link which summarises what you should keep in mind when choosing between these three.
In brief, you should consider familiarity (have you already worked with Hadoop-ecosystem tools? the beam programming model? would you rather work via a UI?) and desired level of control (dataproc allows more control over the cluster, dataflow and dataprep are fully managed services).
More good reads:
Comparing Cloud Dataflow autoscaling to Spark and Hadoop
Cleaning data in a data processing pipeline with Dataflow
Both Dataproc and Dataflow are data processing services on google cloud. What is common about both systems is they can both process batch or streaming data. Both also have workflow templates that are easier to use.
But below are the distinguishing features about the two
Dataproc is designed to run on clusters. Which makes it compatible with Apache Hadoop, hive and spark. It is significantly faster at creating clusters and can auto scale clusters without interruption of running job.
Dataflow is better if your data has no implementation with spark or Hadoop. It does not run on clusters, instead it is based on parallel data processing. As such data is split processed on multiple microprocessors to reduce processing time.
an Important note about Dataproc is, Dataprep provides data cleaning and automatically identifies anomalies in the data. It is integrated with Cloud Storage, BigTable and and BigQuery

What is the difference between Google Cloud Dataflow and Google Cloud Dataproc?

I am using Google Data Flow to implement an ETL data ware house solution.
Looking into google cloud offering, it seems DataProc can also do the same thing.
It also seems DataProc is little bit cheaper than DataFlow.
Does anybody know the pros / cons of DataFlow over DataProc
Why does google offer both?
Yes, Cloud Dataflow and Cloud Dataproc can both be used to implement ETL data warehousing solutions.
An overview of why each of these products exist can be found in the Google Cloud Platform Big Data Solutions Articles
Quick takeaways:
Cloud Dataproc provides you with a Hadoop cluster, on GCP, and access to Hadoop-ecosystem tools (e.g. Apache Pig, Hive, and Spark); this has strong appeal if you are already familiar with Hadoop tools and have Hadoop jobs
Cloud Dataflow provides you with a place to run Apache Beam based jobs, on GCP, and you do not need to address common aspects of running jobs on a cluster (e.g. Balancing work, or Scaling the number of workers for a job; by default, this is automatically managed for you, and applies to both batch and streaming) -- this can be very time consuming on other systems
Apache Beam is an important consideration; Beam jobs are intended to be portable across "runners," which include Cloud Dataflow, and enable you to focus on your logical computation, rather than how a "runner" works -- In comparison, when authoring a Spark job, your code is bound to the runner, Spark, and how that runner works
Cloud Dataflow also offers the ability to create jobs based on "templates," which can help simplify common tasks where the differences are parameter values
Here are three main points to consider while trying to choose between Dataproc and Dataflow
Provisioning
Dataproc - Manual provisioning of clusters
Dataflow - Serverless. Automatic provisioning of clusters
Hadoop Dependencies
Dataproc should be used if the processing has any dependencies to tools in the Hadoop ecosystem.
Portability
Dataflow/Beam provides a clear separation between processing logic and the underlying execution engine. This helps with portability across different execution engines that support the Beam runtime, i.e. the same pipeline code can run seamlessly on either Dataflow, Spark or Flink.
This flowchart from the google website explains how to go about choosing one over the other.
https://cloud.google.com/dataflow/images/flow-vs-proc-flowchart.svg
Further details are available in the below link
https://cloud.google.com/dataproc/#fast--scalable-data-processing
Same reason as why Dataproc offers both Hadoop and Spark: sometimes one programming model is the best fit for the job, sometimes the other. Likewise, in some cases the best fit for the job is the Apache Beam programming model, offered by Dataflow.
In many cases, a big consideration is that one already has a codebase written against a particular framework, and one just wants to deploy it on the Google Cloud, so even if, say, the Beam programming model is superior to Hadoop, someone with a lot of Hadoop code might still choose Dataproc for the time being, rather than rewriting their code on Beam to run on Dataflow.
The differences between Spark and Beam programming models are quite large, and there are a lot of use cases where each one has a big advantage over the other. See https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison .
Cloud Dataflow is a serverless data processing service that runs jobs written using the Apache Beam libraries. When you run a job on Cloud Dataflow, it spins up a cluster of virtual machines, distributes the tasks in your job to the VMs, and dynamically scales the cluster based on how the job is performing. It may even change the order of operations in your processing pipeline to optimize your job.
So use cases are ETL (extract, transfer, load) job between various data sources / data bases. For example load big files from Cloud Storage into BigQuery.
Streaming works based on subscription to PubSub topic, so you can listen to real time events (for example from some IoT devices) and then further process.
Interesting concrete use case of Dataflow is Dataprep. Dataprep is cloud tool on GCP used for exploring, cleaning, wrangling (large) datasets. When you define actions you want to do with your data (like formatting, joining etc), job is run under the hood on Dataflow.
Cloud Dataflow also offers the ability to create jobs based on "templates," which can help simplify common tasks where the differences are parameter values.
Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data.
Super fast — Without using Dataproc, it can take from five to 30
minutes to create Spark and Hadoop clusters on-premises or through
IaaS providers. By comparison, Dataproc clusters are quick to start,
scale, and shutdown, with each of these operations taking 90 seconds
or less, on average. This means you can spend less time waiting for
clusters and more hands-on time working with your data.
Integrated — Dataproc has built-in integration with other Google
Cloud Platform services, such as BigQuery, Cloud Storage, Cloud
Bigtable, Cloud Logging, and Cloud Monitoring, so you have more than
just a Spark or Hadoop cluster—you have a complete data platform.
For example, you can use Dataproc to effortlessly ETL terabytes of
raw log data directly into BigQuery for business reporting.
Managed — Use Spark and Hadoop clusters without the assistance of an
administrator or special software. You can easily interact with
clusters and Spark or Hadoop jobs through the Google Cloud Console,
the Cloud SDK, or the Dataproc REST API. When you're done with a
cluster, you can simply turn it off, so you don’t spend money on an
idle cluster. You won’t need to worry about losing data, because
Dataproc is integrated with Cloud Storage, BigQuery, and Cloud
Bigtable.
Simple and familiar — You don’t need to learn new tools or APIs to
use Dataproc, making it easy to move existing projects into Dataproc
without redevelopment. Spark, Hadoop, Pig, and Hive are frequently
updated, so you can be productive faster.
If you want to migrate from your existing Hadoop/Spark cluster to the cloud, or take advantage of so many well-trained Hadoop/Spark engineers out there in the market, choose Cloud Dataproc; if you trust Google's expertise in large scale data processing and take their latest improvements for free, choose DataFlow.
Here are three main points to consider while trying to choose between Dataproc and Dataflow
Provisioning
Dataproc - Manual provisioning of clusters
Dataflow - Serverless. Automatic provisioning of clusters
Hadoop Dependencies
Dataproc should be used if the processing has any dependencies to tools in the Hadoop ecosystem.
Portability
Dataflow/Beam provides a clear separation between processing logic and the underlying execution engine. This helps with portability across different execution engines that support the Beam runtime, i.e. the same pipeline code can run seamlessly on either Dataflow, Spark or Flink.
Cloud Dataproc and Cloud Dataflow can both be used for data processing, and there’s overlap in their batch and streaming capabilities. You can decide which product is a better fit for your environment.
Cloud Dataproc is good for environments dependent on specific Apache big data components:
- Tools/packages
- Pipelines
- Skill sets of existing resources
Cloud Dataflow is typically the preferred option for green field environments:
- Less operational overhead
- Unified approach to development of batch or streaming pipelines
- Uses Apache Beam
- Supports pipeline portability across Cloud Dataflow, Apache Spark, and Apache Flink as runtimes.
See more details here https://cloud.google.com/dataproc/
Pricing comparision:
DataProc
Dataflow
If you want to calculate and compare cost of more GCP resources, please refer this url https://cloud.google.com/products/calculator/
One of the other important difference is:
Cloud Dataproc:
Data mining and analysis in datasets of known size
Cloud Dataflow:
Manage datasets of unpredictable size
see
Cloud Dataflow
Is a serverless data processing service that runs jobs written using
the Apache Beam libraries.
When you run a job on Cloud Dataflow it gets operated like this:
It spins up a cluster of virtual machines
Distributes the tasks in your job to the VMs, and dynamically scale the cluster based on how the job is performing
Dataflow may even change the order of operations in your processing pipeline to optimize your job.
It supports both batch and streaming Jobs. So use cases are ETL (extract, transfer, load) jobs between various data sources/databases.
For example, load big files from Cloud Storage into Big Query.
Streaming works based on subscription to Pub-Sub topic, so you can listen to real-time events (for example from some IoT devices) and then further process the data.
An interesting concrete use case of Dataflow is Data prep.
Data prep is a cloud tool on GCP used for exploring, cleaning, and wrangling (large) datasets. When you define the actions you want to perform on your data (like formatting, joining etc.), the job run under the hood on Dataflow.
Cloud Dataflow also offers the ability to create jobs based on "templates" which can help simplify common tasks where the differences are parameter values.
Data proc
Is a managed Spark and Hadoop service that lets you take advantage of
open-source data tools for batch processing, querying, streaming, and
machine learning.
Data proc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data.