Hdfs equivalent of snapshotdiff and createsnapshot in Gcloud - google-cloud-platform

We are migrating our existing jobs from Hadoop to GCP.
I encountered two hdfs functions createSnapshot and snapshotDiff in our existing hadoop code. Do we have thier equivalent in GCP?

Table format projects Delta Lake and Apache Iceberg are now available in the latest version of Cloud Dataproc (version 1.5 Preview).
With these table formats, you can now use Dataproc for workloads that need:
ACID transaction
Data versioning (a.k.a. time travel)
Schema enforcement
Schema evolution and more
Data versioning provides a snapshot of your data in history. You can look up data history and roll back to the data at a certain time or version in history.
Please refer to the documentation. I hope it helps.

Related

In AWS Redshift cluster i haved created a database now i want to see the database manually instead of querying it

i have created a database in Redshift cluster now i want see the database and its tables manually instead of querying it.
Where can i see those database
create database example1;
With Redshift, there is no way to look at the data in any way except by issuing queries and commands against it. This is fairly common for most DBMS products.
AWS "recommend" the free tool Sqlworkbench/J
https://docs.aws.amazon.com/redshift/latest/mgmt/connecting-using-workbench.html
In addition you can issue commands against Redshift using the AWS management console
https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor.html
My personal favorite (as a professional developer) is to use the Jetbrains DataGrip product.

Dataprep vs Dataflow vs Dataproc

To perform source data preparation, data transformation or data cleansing, in what scenario should we use Dataprep vs Dataflow vs Dataproc?
Data preparation/transformation/cleaning tasks can all be seen as ETL processes, implementable with any of the products you mention. This older answer covers the basics of the Dataflow vs Dataproc question and includes this link which summarises what you should keep in mind when choosing between these three.
In brief, you should consider familiarity (have you already worked with Hadoop-ecosystem tools? the beam programming model? would you rather work via a UI?) and desired level of control (dataproc allows more control over the cluster, dataflow and dataprep are fully managed services).
More good reads:
Comparing Cloud Dataflow autoscaling to Spark and Hadoop
Cleaning data in a data processing pipeline with Dataflow
Both Dataproc and Dataflow are data processing services on google cloud. What is common about both systems is they can both process batch or streaming data. Both also have workflow templates that are easier to use.
But below are the distinguishing features about the two
Dataproc is designed to run on clusters. Which makes it compatible with Apache Hadoop, hive and spark. It is significantly faster at creating clusters and can auto scale clusters without interruption of running job.
Dataflow is better if your data has no implementation with spark or Hadoop. It does not run on clusters, instead it is based on parallel data processing. As such data is split processed on multiple microprocessors to reduce processing time.
an Important note about Dataproc is, Dataprep provides data cleaning and automatically identifies anomalies in the data. It is integrated with Cloud Storage, BigTable and and BigQuery

Does Bigquery support triggers?

We are currently using AWS RDS as our databases. In tables, we defined some insert or update triggers on tables. I would like to know if Bigquery also support triggers?
thanks
BigQuery is a data warehouse product, similar to AWS Redshift and AWS Athena and there is no trigger support.
If you used AWS RDS so far, you need to check Google CloudSQL.
Google Cloud SQL is an easy-to-use service that delivers fully managed
SQL databases in the cloud. Google Cloud SQL provides either MySQL or
PostgreSQL databases.
If you have a heavy load, then check out Google Cloud Spanner it's even better for full scalable relational db.
Cloud Spanner is the only enterprise-grade, globally-distributed, and
strongly consistent database service built for the cloud specifically
to combine the benefits of relational database structure with
non-relational horizontal scale.
Big Query doesn't have the feature as stated by the colleague above.
However it has an event api based on it's audit logs. You can inspect it and trigger events with cloud functions as per:
https://cloud.google.com/blog/topics/developers-practitioners/how-trigger-cloud-run-actions-bigquery-events
Regards

What is the difference between Google Cloud Dataflow and Google Cloud Dataproc?

I am using Google Data Flow to implement an ETL data ware house solution.
Looking into google cloud offering, it seems DataProc can also do the same thing.
It also seems DataProc is little bit cheaper than DataFlow.
Does anybody know the pros / cons of DataFlow over DataProc
Why does google offer both?
Yes, Cloud Dataflow and Cloud Dataproc can both be used to implement ETL data warehousing solutions.
An overview of why each of these products exist can be found in the Google Cloud Platform Big Data Solutions Articles
Quick takeaways:
Cloud Dataproc provides you with a Hadoop cluster, on GCP, and access to Hadoop-ecosystem tools (e.g. Apache Pig, Hive, and Spark); this has strong appeal if you are already familiar with Hadoop tools and have Hadoop jobs
Cloud Dataflow provides you with a place to run Apache Beam based jobs, on GCP, and you do not need to address common aspects of running jobs on a cluster (e.g. Balancing work, or Scaling the number of workers for a job; by default, this is automatically managed for you, and applies to both batch and streaming) -- this can be very time consuming on other systems
Apache Beam is an important consideration; Beam jobs are intended to be portable across "runners," which include Cloud Dataflow, and enable you to focus on your logical computation, rather than how a "runner" works -- In comparison, when authoring a Spark job, your code is bound to the runner, Spark, and how that runner works
Cloud Dataflow also offers the ability to create jobs based on "templates," which can help simplify common tasks where the differences are parameter values
Here are three main points to consider while trying to choose between Dataproc and Dataflow
Provisioning
Dataproc - Manual provisioning of clusters
Dataflow - Serverless. Automatic provisioning of clusters
Hadoop Dependencies
Dataproc should be used if the processing has any dependencies to tools in the Hadoop ecosystem.
Portability
Dataflow/Beam provides a clear separation between processing logic and the underlying execution engine. This helps with portability across different execution engines that support the Beam runtime, i.e. the same pipeline code can run seamlessly on either Dataflow, Spark or Flink.
This flowchart from the google website explains how to go about choosing one over the other.
https://cloud.google.com/dataflow/images/flow-vs-proc-flowchart.svg
Further details are available in the below link
https://cloud.google.com/dataproc/#fast--scalable-data-processing
Same reason as why Dataproc offers both Hadoop and Spark: sometimes one programming model is the best fit for the job, sometimes the other. Likewise, in some cases the best fit for the job is the Apache Beam programming model, offered by Dataflow.
In many cases, a big consideration is that one already has a codebase written against a particular framework, and one just wants to deploy it on the Google Cloud, so even if, say, the Beam programming model is superior to Hadoop, someone with a lot of Hadoop code might still choose Dataproc for the time being, rather than rewriting their code on Beam to run on Dataflow.
The differences between Spark and Beam programming models are quite large, and there are a lot of use cases where each one has a big advantage over the other. See https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison .
Cloud Dataflow is a serverless data processing service that runs jobs written using the Apache Beam libraries. When you run a job on Cloud Dataflow, it spins up a cluster of virtual machines, distributes the tasks in your job to the VMs, and dynamically scales the cluster based on how the job is performing. It may even change the order of operations in your processing pipeline to optimize your job.
So use cases are ETL (extract, transfer, load) job between various data sources / data bases. For example load big files from Cloud Storage into BigQuery.
Streaming works based on subscription to PubSub topic, so you can listen to real time events (for example from some IoT devices) and then further process.
Interesting concrete use case of Dataflow is Dataprep. Dataprep is cloud tool on GCP used for exploring, cleaning, wrangling (large) datasets. When you define actions you want to do with your data (like formatting, joining etc), job is run under the hood on Dataflow.
Cloud Dataflow also offers the ability to create jobs based on "templates," which can help simplify common tasks where the differences are parameter values.
Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data.
Super fast — Without using Dataproc, it can take from five to 30
minutes to create Spark and Hadoop clusters on-premises or through
IaaS providers. By comparison, Dataproc clusters are quick to start,
scale, and shutdown, with each of these operations taking 90 seconds
or less, on average. This means you can spend less time waiting for
clusters and more hands-on time working with your data.
Integrated — Dataproc has built-in integration with other Google
Cloud Platform services, such as BigQuery, Cloud Storage, Cloud
Bigtable, Cloud Logging, and Cloud Monitoring, so you have more than
just a Spark or Hadoop cluster—you have a complete data platform.
For example, you can use Dataproc to effortlessly ETL terabytes of
raw log data directly into BigQuery for business reporting.
Managed — Use Spark and Hadoop clusters without the assistance of an
administrator or special software. You can easily interact with
clusters and Spark or Hadoop jobs through the Google Cloud Console,
the Cloud SDK, or the Dataproc REST API. When you're done with a
cluster, you can simply turn it off, so you don’t spend money on an
idle cluster. You won’t need to worry about losing data, because
Dataproc is integrated with Cloud Storage, BigQuery, and Cloud
Bigtable.
Simple and familiar — You don’t need to learn new tools or APIs to
use Dataproc, making it easy to move existing projects into Dataproc
without redevelopment. Spark, Hadoop, Pig, and Hive are frequently
updated, so you can be productive faster.
If you want to migrate from your existing Hadoop/Spark cluster to the cloud, or take advantage of so many well-trained Hadoop/Spark engineers out there in the market, choose Cloud Dataproc; if you trust Google's expertise in large scale data processing and take their latest improvements for free, choose DataFlow.
Here are three main points to consider while trying to choose between Dataproc and Dataflow
Provisioning
Dataproc - Manual provisioning of clusters
Dataflow - Serverless. Automatic provisioning of clusters
Hadoop Dependencies
Dataproc should be used if the processing has any dependencies to tools in the Hadoop ecosystem.
Portability
Dataflow/Beam provides a clear separation between processing logic and the underlying execution engine. This helps with portability across different execution engines that support the Beam runtime, i.e. the same pipeline code can run seamlessly on either Dataflow, Spark or Flink.
Cloud Dataproc and Cloud Dataflow can both be used for data processing, and there’s overlap in their batch and streaming capabilities. You can decide which product is a better fit for your environment.
Cloud Dataproc is good for environments dependent on specific Apache big data components:
- Tools/packages
- Pipelines
- Skill sets of existing resources
Cloud Dataflow is typically the preferred option for green field environments:
- Less operational overhead
- Unified approach to development of batch or streaming pipelines
- Uses Apache Beam
- Supports pipeline portability across Cloud Dataflow, Apache Spark, and Apache Flink as runtimes.
See more details here https://cloud.google.com/dataproc/
Pricing comparision:
DataProc
Dataflow
If you want to calculate and compare cost of more GCP resources, please refer this url https://cloud.google.com/products/calculator/
One of the other important difference is:
Cloud Dataproc:
Data mining and analysis in datasets of known size
Cloud Dataflow:
Manage datasets of unpredictable size
see
Cloud Dataflow
Is a serverless data processing service that runs jobs written using
the Apache Beam libraries.
When you run a job on Cloud Dataflow it gets operated like this:
It spins up a cluster of virtual machines
Distributes the tasks in your job to the VMs, and dynamically scale the cluster based on how the job is performing
Dataflow may even change the order of operations in your processing pipeline to optimize your job.
It supports both batch and streaming Jobs. So use cases are ETL (extract, transfer, load) jobs between various data sources/databases.
For example, load big files from Cloud Storage into Big Query.
Streaming works based on subscription to Pub-Sub topic, so you can listen to real-time events (for example from some IoT devices) and then further process the data.
An interesting concrete use case of Dataflow is Data prep.
Data prep is a cloud tool on GCP used for exploring, cleaning, and wrangling (large) datasets. When you define the actions you want to perform on your data (like formatting, joining etc.), the job run under the hood on Dataflow.
Cloud Dataflow also offers the ability to create jobs based on "templates" which can help simplify common tasks where the differences are parameter values.
Data proc
Is a managed Spark and Hadoop service that lets you take advantage of
open-source data tools for batch processing, querying, streaming, and
machine learning.
Data proc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data.

Does AWS Redshift supports postgis extensions?

I know AWS RDS-postgresql supports postgis extension.
Does AWS Redshift support postgis extension?
As per this answer was written (June 2016), AWS Redshift itself still does not support PostGIS extensions.
The official documentation have not much changed. In an AWS blog post, it said implicitly:
You can use the dblink extension to connect to Amazon Redshift and leverage PostgreSQL functionality. .... There are likely many other uses for the dblink extension with Amazon Redshift, such as PostGIS or LDAP support in PostgreSQL (Amazon EC2 only), ....
From the above AWS blog post, we can combine Amazon Redshift and RDS/self-hosted PostgreSQL database to make a PostGIS queries by using dblink.
If your data is in latitude-longitude pairs (EPSG-4326/WGS84) format, you can convert your data to single-dimension data using GeoHash or S2 Geometry Library hash. Personally, I prefer S2 Geometry Library because it has more precision levels and functionalities. After that, you can make some query over that hash column. And of course, you can combine the result to PostGIS database.
According to AWS docs,
Amazon Redshift is based on PostgreSQL 8.0.2. Amazon Redshift and PostgreSQL have a number of very important differences that you must be aware of as you design and develop your data warehouse applications.
(http://docs.aws.amazon.com/redshift/latest/dg/c_redshift-and-postgres-sql.html)
It is not supporting the basic types that PostGIS depends upon
(https://forums.aws.amazon.com/message.jspa?messageID=425664#)
Therefore, the answer is no.