What's the difference between Google Cloud Spanner and Cloud SQL? - google-cloud-platform

I am novice in GCP stack so I am so confused about amount GCP technologies for storing data:
https://cloud.google.com/products/storage
Although google cloud spanner is not mentioned in the article above I know that it is exist and iti is used for data storage: https://cloud.google.com/spanner
From my current view I don't see any significant difference between cloud sql(with postgres under the hood) and cloud spanner. I found that it has a bit different syntax but it doesn't answer when I should prefer this techology to spring cloud sql.
Could you please explain it ?
P.S.
I consider spring cloud sql as a traditional database with automatic replication and horizontal scalability managed by google.

There is not a big difference between them in terms on what they do (storing data in tables). The difference is how they handle the data in a small and big scale
Cloud Spanner is used when you need to handle massive amounts of data with an elevated level of consistency and with a big amount of data handling (+100,000 reads/write per second). Spanner gives much better scalability and better SLOs.
On the other hand, Spanner is also much more expensive than Cloud SQL.
If you just want to store some data of your customer in a cheap way but still don't want to face server configuration Cloud SQL is the right choice.
If you are planning to create a big product or if you want to be ready for a huge increase in users for your application (viral games/applications) Spanner is the right product.
You can find detailed information about Cloud Spanner in this official paper

The main difference between Cloud Spanner and Cloud SQL is the horizontal scalability + global availability of data over 10TB.
Spanner isn’t for generic SQL needs, Spanner is best used for massive-scale opportunities. 1000s of writes per second, globally. 10,000s - 100,000s of reads per second, globally.
Above volume is extremely difficult to achieve with NORMAL SQL / MySQL without doing complex sharding of the database. Spanner deals with all this AND allows ACID updates (which is basically impossible with sharded databases). They accomplish this with super-accurate clocks to manage conflicts.
In short, Spanner is not for CRM databases, it is more for supermassive global data within an organisation. And since Spanner is a bit expensive (compared to cloud SQL), the project should be large enough to justify the additional cost of Spanner.
You can also follow this discussion on Reddit (a good one!): https://www.reddit.com/r/googlecloud/comments/93bxf6/cloud_spanner_vs_cloud_sql/e3cof2r/

Previous answers are correct, the main advantages of Spanner are scalability and availability. While you can scale with Cloud SQL, there is an upper bound to write throughput unless you shard -- which, depending on your use case, can be a major challenge. Dealing with sharded SQL was the big problem that Spanner solved within Google.

I would add to the previous answers that Cloud SQL provides managed instances of MySQL or PostgreSQL or SQL Server, with the corresponding support for SQL. If you're migrating from a MySQL database in a different location, not having to change your queries can be a huge plus.
Spanner has its own SQL dialect, although recently support for a subset of the PostgreSQL dialect was added.

Related

Google Professional Cloud Architect Exam - BigQuery vs Cloud Spanner

Recently I cleared the google cloud PCA exam but want to clarify one question which I have doubt.
" You are tasked with building online analytical processing (OLAP) marketing analytics and reporting tools. This requires a relational database that can operate on hundreds of terabytes of data. What is the Google-recommended tool for such applications?"
What is the answer? Is it Bigquery or cloud spanner? as there are 2 parts in question. If we consider it for OLAP then it is Bigquery and for 2nd part for RDBMS it should be Cloud Spanner.
Appreciate it if I can have some clarification.
Thanks
For Online Analytical Processing (OLAP) databases, consider using BigQuery.
When performing OLAP operations on normalized tables, multiple tables have to be JOINed to perform the required aggregations. JOINs are possible with BigQuery and sometimes recommended on small tables.
You can check this documentation for further information.
BigQuery for OLAP and Google Cloud Spanner for OLTP.
Please check this other page for more information about it.
I agree that the question is confusing.
But according to the official documentation :
Other storage and database options
If you need interactive querying in an online analytical processing
(OLAP) system, consider BigQuery.
However BigQuery is not considered relational database.
The big query does not provide you relationship between tables but you can join them freely.
If your performance falls cluster then partition on the joining fields.
Is it possible to create relationships between tables?
Some more literature if some want to go into the details.
By using MapReduce, enterprises can cost-effectively apply parallel
data processing on their Big Data in a highly scalable manner, without
bearing the burden of designing a large distributed computing cluster
from scratch or purchasing expensive high-end relational database
solutions or appliances.
https://cloud.google.com/files/BigQueryTechnicalWP.pdf
Hence Bigquery

Migrating Dynamodb to Spanner

Currently we are migrating Dynamodb table to Spanner. Since DynamoDb is a nosql database with indexing, it become a difficult task to migrate NOSQL to relational database. The only reason we are migrating it to Spanner is because of secondary indexing. But after migrating few tables, we are witnessing the latency issues in Spanner. Initially we were planned to migrate it to Cloud BigTable, but unfortunately it doesn't support secondary index. Now because of latency issue and high read/write traffic, Spanner performance is going down. Do we have any other data stores in GCP, which would be more suitable with this kind of use case, where we can have nosql as well as secondary index? We have around 200 TB's of data in DynamoDb.
The Google Spanner documentation Quotas & Limits, for improved performance, you should have a node for every 2 TB of data that you have on it. Considering that, I would recommend you to take a look at your nodes and raise the number of them that you have, active right now, yo improve the performance of your database.
On this documentation here, you have the best practices to configure a Spanner as it's best possible performance.
In case this doesn't help, could you please take a look at the documentation Troubleshooting performance regressions? This way, you can take a further look at what might be affecting the performance of your Spanner.
Let me know if the information helped you!
Go to firebase in datastore mode. It has secondary indexes and basically is serverless and practically unlimited in throughput. And is a nosql db as well

Price aside, why ever choose Google Cloud Bigtable over Google Cloud Datastore?

If I have a use case for both huge data storage and searchability, why would I ever choose Google Cloud Bigtable over Google Cloud Datastore?
I've seen a few questions on SO and other sides "comparing" Bigtable and Datastore, but it seems to boil down to the same non-specific answers.
Here's my current knowledge and my thoughts:
Datastore is more expensive.
In the context of this question, let's forget entirely about pricing.
Bigtable is good for huge datasets.
It seems like Datastore is, too? I'm not seeing what specifically makes Bigtable objectively superior here.
Bigtable is better than Datastore for analytics.
How? Why? It seems like I can do analytics in Datastore as well, no problem. Why is Bigtable seemingly the unanimous decision industry-wide for analytics? What value do GMail, eBay, etc. get from Bigtable that Datastore can't provide?
Bigtable is integrated with Hadoop, Spark, etc.
Is Datastore not as well, considering it's built on Bigtable?
From this question, this statement was made in an answer:
Bigtable and Datastore are extremely different. Yes, the datastore is build on top of Bigtable, but that does not make it anything like it. That is kind of like saying a car is build on top of [car] wheels, and so a car is not much different from wheels.
However, this seems analogy seems nonsensical, since the car (including the wheels) intrinsically provides more value than just the wheels of a car by themselves.
It seems at first glance that Bigtable is strictly worse than Datastore, only providing a single index and limiting quick searchability. What am I missing?
Bigtable and Datastore are optimized for slightly different use-cases, and offer different tradeoffs. The main ones are:
Data model:
Bigtable is a wide-column database -- think HBase and Cassandra
Datastore is a document database -- think MongoDB
Note that both of these can be used for key-value use cases
Cost model:
Bigtable charges per provisioned nodes
Datastore is serverless and charges per operation
In general, Bigtable is a good choice if you need:
Fast point-reads and range scans (especially at scale). Bigtable will offer lower latency for key-value lookups, as well as fast scans of contiguous rows - a powerful tool since rows are stored in lexicographic order. If you have simple, predictable query patterns and design your schema well, reading from Bigtable can be incredibly efficient.
High throughput writes (again, especially at scale). This is possible in part because Bigtable is eventually consistent - in exchange you can see big wins in price/performance.
Example use-cases that are great for Bigtable include time series data (for IoT, monitoring, and more - think extremely write heavy workloads and massive amounts of data generated over x units of time), analytics (think fraud detection, personalization, recommendations), and ad-serving (every microsecond counts).
Datastore (or Firestore) is a good choice if you need:
Query flexibility: Datastore offers document support and secondary indexes.
Strong consistency and/or transactions: Bigtable has eventually consistent replication and does not support multi-row transactions.
Mobile SDKs: Datastore and Firestore are incredibly well-integrated with firebase ecosystem.
Example use-cases include mobile and web applications, game state, user profiles, and product catalogs.
To answer a few of your questions explicitly:
Why is Bigtable used for analytics? It's mostly about performance: analytics use-cases are more likely to have large datasets and require high write throughput. It's a lot easier to run into the limits of a database if you're storing clickstream data, as opposed to something like user account information. Fast scans are also important for analytics use-cases: Bigtable allows you to retrieve all of the information you need about a user or a device extremely quickly, which you can process in a batch job or use to create recommendations and analysis on the fly.
Is Bigtable strictly worse than Datastore? Datastore definitely provides more built-in functionality like secondary indexes and document support, and if you need those features, Datastore is a fantastic choice. But that functionality comes with tradeoffs. Bigtable provides perhaps lower-level, but incredibly performant APIs that allow users to make those tradeoffs for themselves: If a user values, say, write performance over secondary indexes, Bigtable is an excellent option. You can think of it as an extremely versatile and powerful infrastructural building block. I actually like the wheel/car analogy: sometimes you don't want the car -- if what you really need is a dirt bike, a set of solid wheels is much more useful :)

AWS Redshift vs Snowflake use cases

I was wondering if anyone has used both AWS Redshift and Snowflake and use cases where one is better . I have used Redshift but recently someone suggested Snowflake as a good alternative . My use case is basically retail marketing data that will be used by handful of analysts who are not terribly SQL savvy and will most likely have reporting tool on top
Redshift is a good product, but it is hard to think of a use case where it is better than Snowflake. Here are some reasons why Snowflake is better:
The admin console is brilliant, Redshift has none.
Scale-up/down happens in seconds to minutes, Redshift takes minutes to hours.
The documentation for both products is good, but Snowflake is better laid
out and more accessible.
You need to know less "secret sauce" to make Snowflake work well. On Redshift you need to know and understand the performance impacts of things like distribution keys and sort keys, at a minimum.
The load processes for Snowflake are more elegant than Redshift. Redshift assumes that your data is in S3 already. Snowflake supports S3, but has extensions to JDBC, ODBC and dbAPI that really simplify and secure the ingestion process.
Snowflake has great support for in-database JSON, and is rapidly enhancing its XML. Redshift has a more complex approach to JSON, and recommends against it for all but smaller use cases, and does not support XML.
I can only think of two cases which Redshift wins hands-down. One is geographic availability, as Redshift is available in far more locations than Snowflake, which can make a difference in data transfer and statement submission times. The other is the ability to submit a batch of multiple statements. Snowflake can only accept one statement at a time, and that can slow down your batches if they comprise many statements, especially if you are on another continent to your server.
At Ajilius our developers use Redshift, Snowflake and Azure SQL Data Warehouse on a daily basis; and we have customers on all three platforms. Even with that choice, every developer prefers Snowflake as their go-to cloud DW.
I evaluated both Redshift(Redshfit spectrum with S3) and SnowFlake.
In my poc, snowFlake is way way better than Redshift. SnowFlake integrates well with Relational/NOSQL data. No upfront index or partition key required. It works amazing without worrying about what way to access the day.
Redshift is very limited and no json support. Its hard to understand the partition. You have to do lot of work to get something done. No json support. You can use redshift specturm as a bandaid to access S3. Good luck with partioning upfront. Once you created partition in S3 bucket, you are done with that and no way to change until unless you redo process all data again to new structue. You will end up sending time to fix these issues instead of working on fixing real business problems.
Its like comparing Smartphone vs Morse code mechine. Redshift is like morse code kind of implementation and its not for mordern development
We recently switched from Redshift to Snowflake for the following reasons:
Real-time data syncing
Handling of concurrent queries
Minimizing of database administration
Providing different amounts of computing power to different Looker users
A more in-depth writeup can be found on our data blog.
I evaluated Redshift and Snowflake, and a little bit of Athena and Spectrum as well. The latter two were non-starters in cases where we had big joins, as they would run out of memory. For Redshift, I could actually get a better price to performance ratio for a couple reasons:
allows me to choose a distribution key which is huge for co-located joins
allows for extreme discounts on three year reserved pricing, so much so that you can really upsize your compute at a reasonable cost
I could get better performance in most cases with Redshift, but it requires good MPP knowledge to setup the physical schema properly. The cost of expertise and complexity offsets some of the product cost.
Redshift stores JSON in a VARCHAR column. That can cause problems (OOM) when querying a subset of JSON elements across large tables, where the VARCHAR column is sized too big. In our case we had to define the VARCHAR as extremely large to accommodate a few records that had very large JSON documents.
Snowflake functionality is amazing, including:
ability to clone objects
deep functionality in handling JSON data
snowpipe for low maintenance loading, auto scaling loads, trickle updates
streams & tasks for home grown ETL
ability to scale storage and compute separately
ability to scale compute within a minute, requiring no data migration
and many more
One thing that I would caution about Snowflake is that one might be tempted to hire less skilled developers/DBAs to run the system. Performance in a bad schema design can be worked around using a huge compute cluster, but that may not be the best bang for the buck. Regardless, the functionality in Snowflake is amazing.

looking for a hosted back-end business data storage for analytics

i want a simple hosted data store for licensed for business applications. i want the following features:
REST-like access for CRUD operations (primarily adding records)
private and authenticated
makes for easy integration with a front end charting client like Google Visualization Apis
easy to use and set up
what about:
* Google Fusion Tables
* Google Cloud Services
* Google BigQuery
* Google Cloud SQL
or other non-google products. but i am imagining a cleaner integration between Google Charts and one of their backend data services.
Pros, Cons, Advice?
First, since this is Stack Overflow, I won't attempt to provide a judgement about how about "easy to use and setup" - that can be done by you reading the documentation for each product.
That being said, overall, the "right" answer really depends on what you are trying to do, and how much data you have. It also depends on what type of application you are building (this is Stack Overflow, so I am assuming you are a developer).
Relational Databases (like Google Cloud SQL) are great for maintaining transactional consistency but once your data grows massive it becomes difficult, expensive, or impossible to run analysis queries in a reasonable timeframe.
Google BigQuery is an analysis tool that allows developers to ask questions about really really big datasets using an SQL like language. It is 100% cloud based and is accessed via RESTful API - but it only allows for appending data, not changing individual records.