with Google cloud platform, cloud SQL you get a lot of options to setup the infrastructure. Does this mean cloud SQL is infrastructure as a service ?
No, the infrastucture of Cloud SQL is managed by Google and by it's engineers, so, Cloud SQL is PAAS (Plaform As A Service).
Cloud SQL is a docker container built on top of a GCE instance, and Google monitor everything for you, and fix the Cloud SQL instance automatically if something goes wrong (Sometimes Google software engineers have to perform some actions to fix some issues if the instance is stuck). So, the only thing that you have to take care of is to store and query your data.
Also, Cloud SQL offers a lot of interesting features, such as, failover replicas, read replicas, user and database adminitration, etc.
So, in Cloud SQL, Google doesn't just sell the infrastucture to create databases, but also the application itself and the monitoring tools too.
Related
I need to perform a periodic data purge and data load operations between GCP BigQuery and GCP CloudSql.
This involves running multiple queries in GCP BigQuery and GCP cloud SQL in a predetermined sequence and using the query results from earlier queries in subsequent queries
I am considering a few options as described below
Option1 : Use BigQuery "Scheduled Queries" that uses "Federated query" (https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries). This is good as far job involves trigger "read only" queries in gcp cloud SQL database and run multiple queries in gcp BiqQuery.
However since my operation involves purging data from GCP cloud SQL so federated queries are ruled out.
Options 2: Another option I am considering is to use a gcp compute linux engine VM as my controller for performing operations that span across a gcp cloud SQL mysql database and GCP bigQuery.
I can run a cron job to schedule the operation
As far as running gcp cloud SQL queries from a gcp compute engine VM goes, that is well documented by google in a tutorial "Connect from a VM instance" ( Learn how to connect your Cloud SQL instance from a Compute Engine VM instance)
And , for trigering the gcp Big Query queries, bq command line tool (https://cloud.google.com/bigquery/docs/bq-command-line-tool) provides a good option.
This should allow me to run a sequence of interlaced BigQuery and gcp cloud SQL
Do you see any gotchas in "option 2" described above that I am contemplating.
Is there any other option that you can suggest ? I wonder if cloud dataflow is an appropriate solution for a task that involves running queries across multiple databases ( cloud SQL and BigQuery in this case) and using the intermediate query results in subsequent queries
Thinking about your option 2, I probably would consider DataFlow, Cloud Functions, Cloud Run, as the main backbone for those operations instead of VMs. You might find serverless solutions much cheaper and more reliable depending on your wider context, and "how frequently" the "periodic" process is to run.
On the other hand, if you (or your company) has already relevant experience in "some code" on VM, but no skills, knowledge and experience in the serverless solutions, the "education overhead" can increase the overall cost of this path.
To orchestrate your queries, you need an orchestrator. There are 2 on Google Cloud:
The big one Cloud Composer, entreprise grade (feature and cost!)
The new one: Cloud Workflows. Serverless and easy to use
I recommend you to use Cloud Workflows. You can create your workflow of call that you perform to BigQuery (federated queries or not)
If you need to update/delete data in Cloud SQL, I recommend you to create a proxy Cloud Functions (or Cloud Run) that take in parameter a SQL query and execute it on Cloud SQL.
You can call your Cloud FUnctions (or Cloud Run) with Workflow, it's only a HTTP call to perform.
Workflow has also the feature to handle and to propose some processing capacity on the answer gather from your API Calls. So you can parse the response, iterate on it and call the subsequent step, even by injecting some data coming from previous steps.
In general any cloud service provider, GCP in this context, is it not relevant and mandatory for Google to specifically allow consumers to choose data residency and data processing region option for all services? Else serverless option will have serious adoption issue. Please clarify.
Google Cloud have two types of the products available: that have specified location and available globally.
You can deploy resources in specific location, multi-regional for:
Compute: Compute Engine, App Engine, Google Kubernetes Engine, Cloud Functions
Storage & Databases: Cloud Storage, Bigtable, Spanner, Cloud SQL, Firestore, Memorystore, Persistent Disk...
BigData & Machine learning: BigQuery, Composer, Dataflow, Dataproc, AI training,
Networking: VPC, Cloud Load Balancing,
Developer Tools...
Following products are available only globally: Networking, Big Data Pub/Sub, Machine Learning like vision API, Management Tools, Developer Tools, IAM.
For detailed list please check Google Cloud Locations Documentation
Even if the product is available globally, for example PubSub: it is possible to specify where messages are stored.
If the data in transit are the concern, you have to be aware that Google Cloud Platform uses data encryption at Rest. It consists on several layers of encryption to protect customer data.
I'm studying how to use GCP, especially focus on the Big Data and analytic functions, I'm not quite sure about their functionality. I did some mapping to understand these components. Could you help to check out my understanding?
Cloud Pub/Sub: Apache Kafka
Cloud Dataproc: Apache Hadoop, Spark
GCS: HDFS compatible
Cloud Dataflow: Apache Beam, Flink
Datastore: MongoDB
BigQuery: Teradata
BigTable: HBase
Memorystore: Redis
Cloud SQL: MySQL, PostgreSQL
Cloud Composer: Informatica
Cloud Data Studio: Tableau
Cloud Datalab: Jupyter notebook
I'm not totally sure what you want to know, your understanding of the GCP products is not far off, but if you are studiying GCP and want to understand them better, you can take a look at the Google Cloud developer's cheat sheet. It has a brief explanation of all the products inside GCP.
Link to the GitHub of the cheat sheet
right now i am using microsoft sql server for my database in my dev app. in future if i want to migrate my database in google spanner what guidelines should i follow from right now so then migration should be easy in feature. also does google provide any migration tools like Microsoft® Data Migration Assistant.
SYNOPSIS
gcloud spanner instances create INSTANCE --config=CONFIG
--description=DESCRIPTION --nodes=NODES [--async] [GLOBAL-FLAG ...]
does spanner has any local emulator so i can install it in my local machine and test it before
gcloud spanner instances create --help
Cloud spanner is Google's horizontally scalable relational database. It is quite expensive(running it in minimal configuration with 3 nodes would cost you at least 100$ daily). Unless you really need the horizontal scalability you should use Cloud SQL.
Cloud SQL is a managed MySQL or PostgreSQL service by Google. You can migrate your data to Cloud SQL easily as this is a common use case. How you do it depends on your choice. For example check this question for exporting it to MySql. You can check this link to convert to PostgreSQL.
Check the Google's decision tree if you are unfamiliar with the details of Google's storage options:
Let's say a company has an application with a database hosted on AWS and also has a read replica on AWS. Then that same company wants to build out a data analytics infrastructure in Google Cloud -- to take advantage of data analysis and ML services in Google Cloud.
Is it necessary to create an additional read replica within the Google Cloud context? If not, is there an alternative strategy that is frequently used in this context to bridge the two cloud services?
While services like Amazon Relational Database Service (RDS) provides read-replica capabilities, it is only between managed database instances on AWS.
If you are replicating a database between providers, then you are probably running the database yourself on virtual machines rather than using a managed service. This means the databases appear just like any resource on the Internet, so you can connect them exactly the way you would connect two resources across the internet. However, you would be responsible for managing, monitoring, deploying, etc. This takes away from much of the benefit of using cloud services.
Replicating between storage services like Amazon S3 would be easier since it is just raw data rather than a running database. Also, Big Data is normally stored in raw format rather than being loaded into a database.
If the existing infrastructure is on a cloud provider, then try to perform the remaining activities on the same cloud provider.