I am using Google Data Flow to implement an ETL data ware house solution.
Looking into google cloud offering, it seems DataProc can also do the same thing.
It also seems DataProc is little bit cheaper than DataFlow.
Does anybody know the pros / cons of DataFlow over DataProc
Why does google offer both?
Yes, Cloud Dataflow and Cloud Dataproc can both be used to implement ETL data warehousing solutions.
An overview of why each of these products exist can be found in the Google Cloud Platform Big Data Solutions Articles
Quick takeaways:
Cloud Dataproc provides you with a Hadoop cluster, on GCP, and access to Hadoop-ecosystem tools (e.g. Apache Pig, Hive, and Spark); this has strong appeal if you are already familiar with Hadoop tools and have Hadoop jobs
Cloud Dataflow provides you with a place to run Apache Beam based jobs, on GCP, and you do not need to address common aspects of running jobs on a cluster (e.g. Balancing work, or Scaling the number of workers for a job; by default, this is automatically managed for you, and applies to both batch and streaming) -- this can be very time consuming on other systems
Apache Beam is an important consideration; Beam jobs are intended to be portable across "runners," which include Cloud Dataflow, and enable you to focus on your logical computation, rather than how a "runner" works -- In comparison, when authoring a Spark job, your code is bound to the runner, Spark, and how that runner works
Cloud Dataflow also offers the ability to create jobs based on "templates," which can help simplify common tasks where the differences are parameter values
Here are three main points to consider while trying to choose between Dataproc and Dataflow
Provisioning
Dataproc - Manual provisioning of clusters
Dataflow - Serverless. Automatic provisioning of clusters
Hadoop Dependencies
Dataproc should be used if the processing has any dependencies to tools in the Hadoop ecosystem.
Portability
Dataflow/Beam provides a clear separation between processing logic and the underlying execution engine. This helps with portability across different execution engines that support the Beam runtime, i.e. the same pipeline code can run seamlessly on either Dataflow, Spark or Flink.
This flowchart from the google website explains how to go about choosing one over the other.
https://cloud.google.com/dataflow/images/flow-vs-proc-flowchart.svg
Further details are available in the below link
https://cloud.google.com/dataproc/#fast--scalable-data-processing
Same reason as why Dataproc offers both Hadoop and Spark: sometimes one programming model is the best fit for the job, sometimes the other. Likewise, in some cases the best fit for the job is the Apache Beam programming model, offered by Dataflow.
In many cases, a big consideration is that one already has a codebase written against a particular framework, and one just wants to deploy it on the Google Cloud, so even if, say, the Beam programming model is superior to Hadoop, someone with a lot of Hadoop code might still choose Dataproc for the time being, rather than rewriting their code on Beam to run on Dataflow.
The differences between Spark and Beam programming models are quite large, and there are a lot of use cases where each one has a big advantage over the other. See https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison .
Cloud Dataflow is a serverless data processing service that runs jobs written using the Apache Beam libraries. When you run a job on Cloud Dataflow, it spins up a cluster of virtual machines, distributes the tasks in your job to the VMs, and dynamically scales the cluster based on how the job is performing. It may even change the order of operations in your processing pipeline to optimize your job.
So use cases are ETL (extract, transfer, load) job between various data sources / data bases. For example load big files from Cloud Storage into BigQuery.
Streaming works based on subscription to PubSub topic, so you can listen to real time events (for example from some IoT devices) and then further process.
Interesting concrete use case of Dataflow is Dataprep. Dataprep is cloud tool on GCP used for exploring, cleaning, wrangling (large) datasets. When you define actions you want to do with your data (like formatting, joining etc), job is run under the hood on Dataflow.
Cloud Dataflow also offers the ability to create jobs based on "templates," which can help simplify common tasks where the differences are parameter values.
Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data.
Super fast — Without using Dataproc, it can take from five to 30
minutes to create Spark and Hadoop clusters on-premises or through
IaaS providers. By comparison, Dataproc clusters are quick to start,
scale, and shutdown, with each of these operations taking 90 seconds
or less, on average. This means you can spend less time waiting for
clusters and more hands-on time working with your data.
Integrated — Dataproc has built-in integration with other Google
Cloud Platform services, such as BigQuery, Cloud Storage, Cloud
Bigtable, Cloud Logging, and Cloud Monitoring, so you have more than
just a Spark or Hadoop cluster—you have a complete data platform.
For example, you can use Dataproc to effortlessly ETL terabytes of
raw log data directly into BigQuery for business reporting.
Managed — Use Spark and Hadoop clusters without the assistance of an
administrator or special software. You can easily interact with
clusters and Spark or Hadoop jobs through the Google Cloud Console,
the Cloud SDK, or the Dataproc REST API. When you're done with a
cluster, you can simply turn it off, so you don’t spend money on an
idle cluster. You won’t need to worry about losing data, because
Dataproc is integrated with Cloud Storage, BigQuery, and Cloud
Bigtable.
Simple and familiar — You don’t need to learn new tools or APIs to
use Dataproc, making it easy to move existing projects into Dataproc
without redevelopment. Spark, Hadoop, Pig, and Hive are frequently
updated, so you can be productive faster.
If you want to migrate from your existing Hadoop/Spark cluster to the cloud, or take advantage of so many well-trained Hadoop/Spark engineers out there in the market, choose Cloud Dataproc; if you trust Google's expertise in large scale data processing and take their latest improvements for free, choose DataFlow.
Here are three main points to consider while trying to choose between Dataproc and Dataflow
Provisioning
Dataproc - Manual provisioning of clusters
Dataflow - Serverless. Automatic provisioning of clusters
Hadoop Dependencies
Dataproc should be used if the processing has any dependencies to tools in the Hadoop ecosystem.
Portability
Dataflow/Beam provides a clear separation between processing logic and the underlying execution engine. This helps with portability across different execution engines that support the Beam runtime, i.e. the same pipeline code can run seamlessly on either Dataflow, Spark or Flink.
Cloud Dataproc and Cloud Dataflow can both be used for data processing, and there’s overlap in their batch and streaming capabilities. You can decide which product is a better fit for your environment.
Cloud Dataproc is good for environments dependent on specific Apache big data components:
- Tools/packages
- Pipelines
- Skill sets of existing resources
Cloud Dataflow is typically the preferred option for green field environments:
- Less operational overhead
- Unified approach to development of batch or streaming pipelines
- Uses Apache Beam
- Supports pipeline portability across Cloud Dataflow, Apache Spark, and Apache Flink as runtimes.
See more details here https://cloud.google.com/dataproc/
Pricing comparision:
DataProc
Dataflow
If you want to calculate and compare cost of more GCP resources, please refer this url https://cloud.google.com/products/calculator/
One of the other important difference is:
Cloud Dataproc:
Data mining and analysis in datasets of known size
Cloud Dataflow:
Manage datasets of unpredictable size
see
Cloud Dataflow
Is a serverless data processing service that runs jobs written using
the Apache Beam libraries.
When you run a job on Cloud Dataflow it gets operated like this:
It spins up a cluster of virtual machines
Distributes the tasks in your job to the VMs, and dynamically scale the cluster based on how the job is performing
Dataflow may even change the order of operations in your processing pipeline to optimize your job.
It supports both batch and streaming Jobs. So use cases are ETL (extract, transfer, load) jobs between various data sources/databases.
For example, load big files from Cloud Storage into Big Query.
Streaming works based on subscription to Pub-Sub topic, so you can listen to real-time events (for example from some IoT devices) and then further process the data.
An interesting concrete use case of Dataflow is Data prep.
Data prep is a cloud tool on GCP used for exploring, cleaning, and wrangling (large) datasets. When you define the actions you want to perform on your data (like formatting, joining etc.), the job run under the hood on Dataflow.
Cloud Dataflow also offers the ability to create jobs based on "templates" which can help simplify common tasks where the differences are parameter values.
Data proc
Is a managed Spark and Hadoop service that lets you take advantage of
open-source data tools for batch processing, querying, streaming, and
machine learning.
Data proc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data.
Related
The cloud workflow doesn't come with a scheduling feature. Apart from that, what are all the differences between these two services in terms of features? In which use case should we prefer the workflow over composer or vice versa?
There are some key differences to consider when choosing between the two solutions :
A Composer instance needs to be in a running state to trigger DAGs and you'll also need to size your Cloud Composer instance based on your usage, You do not need to do this in Cloud Workflows as it is a Serverless service and you pay for anytime a workflow is triggered
Another key difference is that Cloud Composer is really convenient for writing and orchestrating data pipelines because of it's internal scheduler and also because of the provided Operators, You can interact with any Data services inside of GCP.
However, Cloud Workflows interacts with Cloud Functions, wich is a task that Composer cannot do really well.
Both Composer and Workflows support orchestrating multiple services and can handle long running workflows. Despite there being some overlap in the capabilities of these products, each has differentiators that make them well suited to particular use cases.
Composer is most commonly used for orchestrating the transformation of data as part of ELT or data engineering. Workflows, in contrast, is focused on the orchestration of HTTP-based services built with Cloud Functions, Cloud Run, or external APIs.
Composer is designed for orchestrating batch workloads that can handle a delay of a few seconds between task executions. It wouldn’t be suitable if low latency was required in between tasks, whereas Workflows is designed for latency sensitive use cases.
While you don’t have to worry about maintaining Airflow deployments in Composer, you do need to specify how many workers you need for a given Composer environment. Workflows is completely serverless; there is no infrastructure to manage or scale.
For further information refer to this google blog article and this one.
I am trying to build an app where the user is able to upload a file to cloud storage. This would then trigger a model training process (and predicting later on). Initially I though I could do this with cloud functions/pubsub and cloudml, but it seems that cloud functions are not able to trigger gsutil commands which is needed for cloudml.
Is my only option to enable cloud-composer and attach GPUs to a kubernetes node and create a cloud function that triggers a dag to boot up a pod on the node with GPUs and mounting the bucket with the data? Seems a bit excessive but I can't think of another way currently.
You're correct. As for now, there's no possibility to execute gsutil command from a Google Cloud Function:
Cloud Functions can be written in Node.js, Python, Go, and Java, and are executed in language-specific runtimes.
I really like your second approach with triggering the DAG.
Another idea that comes to my mind is to interact with GCP Virtual Machines within Cloud Composer through the Python operator by using the Compute Engine Pyhton API. You can find more information in automating infrastructure and taking a deep technical dive into the core features of Cloud Composer here.
Another solution that you can think of is Kubeflow, which aims to make running ML workloads on Kubernetes. Kubeflow adds some resources to your cluster to assist with a variety of tasks, including training and serving models and running Jupyter Notebooks. Please, have a look on Codelabs tutorial.
I hope you find the above pieces of information useful.
We have a Python data pipeline that run in our server. It grab data from various sources, aggregate and write data to sqlite databases. The daily runtime is just 1 hours and network maybe 100mb at most. What are our options to migrate this to Google Cloud? We would like to have more reliable scheduling, cloud database and better data analytics options from the data (powerful dashboard and visualization) and easy development. Should we go with serverless or server? Is the pricing free for such low usage?
for a lift and shift option, you can run your python workload on the Google Compute Engine, which is a virtual machine, but for best use of Google Cloud, i suggest you to:
Spin up a Google Compute Engine
Run your Python Workload
Save your data on Google Big Query
Shutdown your VM
Schedule it using the Cloud Scheduler
Here is a tutorial from Google on how to do it:
https://cloud.google.com/scheduler/docs/start-and-stop-compute-engine-instances-on-a-schedule
GCP on a shoestring budget:
Google Gives you $300 to spend for first 12 months and there are some services which gives you free usage per month: https://cloud.google.com/free/docs/gcp-free-tier
For example:
You can use BigQuery free-of-charge 1 TB of querying per month and 10 GB of storage each month.
Here's an excellent video on making the most of out of GCP Free tiers: https://www.youtube.com/watch?v=N2OG1w6bGFo&t=818s
Approach to migration:
When moving to cloud you typically choose from one of the following approaches:
1) Rehost (lift-and-shift) no modification to code or architecture
2) Replatform - with minor modifications to code
3) Refactor - with modifications to code and architecture
Obviously you'll get the most cloud benefits (i.e. performance and cost efficiency) with option 3 but it will take longer whereas option 1 is quicker with least amount of benefits.
You can use Cloud Composer for scheduling which is effectively managed apache airflow service. It will allow you to manage batch, streaming and schedule tasks.
Visualisation can be done through Google Data Studio, which can use BigQuery as datasource. Data Studio is free but querying on BigQuery will be chargeable.
BigQuery for data-analytics.
For database you can migrate to managed CloudSQL which supports Postgres and MySQL database types.
Typically serverless stuff is likely to be cost effective if you can accommodate it which obviously will fall into option 3) refactor.
There is several requirement to take care before migrating like: Is all your datasources are reachable by a cloud platform?
About the storage and analytics, BigQuery is an amazing product, and work very well with denormalized data. Is your data can be denormalized? Is your job required transactional capabilities?
Is your data need to be requested on website? BigQuery is powerful for analytics but there is about 1s of query warming, not acceptable on website. It's not like CLoud SQL (MySQL or PostgreSQL) response time which is in millis, but limited to some TB (and having good response time with TB in Cloud SQL is a challenge!)
If it's only for dashboarding, you can use Datastudio, it's free and you can cache your BigQuery data with BI-Engine for more responsive dashboards.
If all of this requirements works for you, and if your datasources are publicly accessible on internet (I mean no VPN requirement for accessing them), I can propose you a full serverless solution. This solution is a side use of Google Cloud Service, and that works well!
I wrote an article on similar use and you can take inspiration on it. Cloud Build allows you to run CI pipeline, and you can use Custom Builder: it's a container that you build yourself and that you can run on Cloud Build.
By the way,
Package your current workflow in a container compliant with Cloud Build, and write your Cloud Build jobs (don't forget to set the right timeout value)
Create a Cloud Function or Cloud Run (if you prefer container) that run Cloud Build; with optionally some substitutions variable for customizing your run.
Set up a Cloud Scheduler to trigger every day your Cloud Run or Cloud Function
Out of BigQuery cost, this pattern cost 0! you have 120 free minutes per day (per billing account) with Cloud Build, Cloud Scheduler is free (up to 3 scheduler per billing account) and Cloud Function/Cloud Run have a huge free tier (here only run some milliseconds).
Streaming to BigQuery is not free but affordable. Half of cent for 100Mb!!
Note: Cloud Run will propose, a day, long running jobs. By the way you could reuse your Cloud Builder container into Cloud Run when the feature will be released. Today, I propose a workaround of this
In Google Cloud Platform BigQuery is great serverless choice - you can start small and grow over time.
With partitioning, cluster and other improvements, we've been successfully using it with UI (4-8k queries per day) with most queries completing under second.
You can also get all data seamlessly ingested from the various sources with millions of files per day to one or many tables with BqTail
I want to build some neural network models for NLP and recommendation applications. The framework I want to use is TensorFlow. I plan to train these models and make predictions on Amazon web services. The application will be most likely distributed computing.
I am wondering what are the pros and cons of SageMaker and EMR for TensorFlow applications?
They both have TensorFlow integrated.
In general terms, they serve different purposes.
EMR is when you need to process massive amounts of data and heavily rely on Spark, Hadoop, and MapReduce (EMR = Elastic MapReduce). Essentially, if your data is in large enough volume to make use of the efficiencies of Spark, Hadoop, Hive, HDFS, HBase and Pig stack then go with EMR.
EMR Pros:
Generally, low cost compared to EC2 instances
As the name suggests Elastic meaning you can provision what you need when you need it
Hive, Pig, and HBase out of the box
EMR Cons:
You need a very specific use case to truly benefit from all the offerings in EMR. Most don't take advantage of its entire offering
SageMaker is an attempt to make Machine Learning easier and distributed. The marketplace provides out of the box algos and models for quick use. It's a great service if you conform to the workflows it enforces. Meaning creating training jobs, deploying inference endpoints
SageMaker Pros:
Easy to get up and running with Notebooks
Rich marketplace to quickly try existing models
Many different example notebooks for popular algorithms
Predefined kernels that minimize configuration
Easy to deploy models
Allows you to distribute inference compute by deploying endpoints
SageMaker Cons:
Expensive!
Enforces a certain workflow making it hard to be fully custom
Expensive!
From AWS documentation:
Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. By using these frameworks and related open-source projects, such as Apache Hive and Apache Pig, you can process data for analytics purposes and business intelligence workloads. Additionally, you can use Amazon EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
(...) Amazon SageMaker is a fully-managed platform that enables developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale. Amazon SageMaker removes all the barriers that typically slow down developers who want to use machine learning.
Conclussion:
If you want to deploy AI models just use AWS SageMaker
To perform source data preparation, data transformation or data cleansing, in what scenario should we use Dataprep vs Dataflow vs Dataproc?
Data preparation/transformation/cleaning tasks can all be seen as ETL processes, implementable with any of the products you mention. This older answer covers the basics of the Dataflow vs Dataproc question and includes this link which summarises what you should keep in mind when choosing between these three.
In brief, you should consider familiarity (have you already worked with Hadoop-ecosystem tools? the beam programming model? would you rather work via a UI?) and desired level of control (dataproc allows more control over the cluster, dataflow and dataprep are fully managed services).
More good reads:
Comparing Cloud Dataflow autoscaling to Spark and Hadoop
Cleaning data in a data processing pipeline with Dataflow
Both Dataproc and Dataflow are data processing services on google cloud. What is common about both systems is they can both process batch or streaming data. Both also have workflow templates that are easier to use.
But below are the distinguishing features about the two
Dataproc is designed to run on clusters. Which makes it compatible with Apache Hadoop, hive and spark. It is significantly faster at creating clusters and can auto scale clusters without interruption of running job.
Dataflow is better if your data has no implementation with spark or Hadoop. It does not run on clusters, instead it is based on parallel data processing. As such data is split processed on multiple microprocessors to reduce processing time.
an Important note about Dataproc is, Dataprep provides data cleaning and automatically identifies anomalies in the data. It is integrated with Cloud Storage, BigTable and and BigQuery