Migrating Dynamodb to Spanner - google-cloud-platform

Currently we are migrating Dynamodb table to Spanner. Since DynamoDb is a nosql database with indexing, it become a difficult task to migrate NOSQL to relational database. The only reason we are migrating it to Spanner is because of secondary indexing. But after migrating few tables, we are witnessing the latency issues in Spanner. Initially we were planned to migrate it to Cloud BigTable, but unfortunately it doesn't support secondary index. Now because of latency issue and high read/write traffic, Spanner performance is going down. Do we have any other data stores in GCP, which would be more suitable with this kind of use case, where we can have nosql as well as secondary index? We have around 200 TB's of data in DynamoDb.

The Google Spanner documentation Quotas & Limits, for improved performance, you should have a node for every 2 TB of data that you have on it. Considering that, I would recommend you to take a look at your nodes and raise the number of them that you have, active right now, yo improve the performance of your database.
On this documentation here, you have the best practices to configure a Spanner as it's best possible performance.
In case this doesn't help, could you please take a look at the documentation Troubleshooting performance regressions? This way, you can take a further look at what might be affecting the performance of your Spanner.
Let me know if the information helped you!

Go to firebase in datastore mode. It has secondary indexes and basically is serverless and practically unlimited in throughput. And is a nosql db as well

Related

Google Professional Cloud Architect Exam - BigQuery vs Cloud Spanner

Recently I cleared the google cloud PCA exam but want to clarify one question which I have doubt.
" You are tasked with building online analytical processing (OLAP) marketing analytics and reporting tools. This requires a relational database that can operate on hundreds of terabytes of data. What is the Google-recommended tool for such applications?"
What is the answer? Is it Bigquery or cloud spanner? as there are 2 parts in question. If we consider it for OLAP then it is Bigquery and for 2nd part for RDBMS it should be Cloud Spanner.
Appreciate it if I can have some clarification.
Thanks
For Online Analytical Processing (OLAP) databases, consider using BigQuery.
When performing OLAP operations on normalized tables, multiple tables have to be JOINed to perform the required aggregations. JOINs are possible with BigQuery and sometimes recommended on small tables.
You can check this documentation for further information.
BigQuery for OLAP and Google Cloud Spanner for OLTP.
Please check this other page for more information about it.
I agree that the question is confusing.
But according to the official documentation :
Other storage and database options
If you need interactive querying in an online analytical processing
(OLAP) system, consider BigQuery.
However BigQuery is not considered relational database.
The big query does not provide you relationship between tables but you can join them freely.
If your performance falls cluster then partition on the joining fields.
Is it possible to create relationships between tables?
Some more literature if some want to go into the details.
By using MapReduce, enterprises can cost-effectively apply parallel
data processing on their Big Data in a highly scalable manner, without
bearing the burden of designing a large distributed computing cluster
from scratch or purchasing expensive high-end relational database
solutions or appliances.
https://cloud.google.com/files/BigQueryTechnicalWP.pdf
Hence Bigquery

What's the difference between Google Cloud Spanner and Cloud SQL?

I am novice in GCP stack so I am so confused about amount GCP technologies for storing data:
https://cloud.google.com/products/storage
Although google cloud spanner is not mentioned in the article above I know that it is exist and iti is used for data storage: https://cloud.google.com/spanner
From my current view I don't see any significant difference between cloud sql(with postgres under the hood) and cloud spanner. I found that it has a bit different syntax but it doesn't answer when I should prefer this techology to spring cloud sql.
Could you please explain it ?
P.S.
I consider spring cloud sql as a traditional database with automatic replication and horizontal scalability managed by google.
There is not a big difference between them in terms on what they do (storing data in tables). The difference is how they handle the data in a small and big scale
Cloud Spanner is used when you need to handle massive amounts of data with an elevated level of consistency and with a big amount of data handling (+100,000 reads/write per second). Spanner gives much better scalability and better SLOs.
On the other hand, Spanner is also much more expensive than Cloud SQL.
If you just want to store some data of your customer in a cheap way but still don't want to face server configuration Cloud SQL is the right choice.
If you are planning to create a big product or if you want to be ready for a huge increase in users for your application (viral games/applications) Spanner is the right product.
You can find detailed information about Cloud Spanner in this official paper
The main difference between Cloud Spanner and Cloud SQL is the horizontal scalability + global availability of data over 10TB.
Spanner isn’t for generic SQL needs, Spanner is best used for massive-scale opportunities. 1000s of writes per second, globally. 10,000s - 100,000s of reads per second, globally.
Above volume is extremely difficult to achieve with NORMAL SQL / MySQL without doing complex sharding of the database. Spanner deals with all this AND allows ACID updates (which is basically impossible with sharded databases). They accomplish this with super-accurate clocks to manage conflicts.
In short, Spanner is not for CRM databases, it is more for supermassive global data within an organisation. And since Spanner is a bit expensive (compared to cloud SQL), the project should be large enough to justify the additional cost of Spanner.
You can also follow this discussion on Reddit (a good one!): https://www.reddit.com/r/googlecloud/comments/93bxf6/cloud_spanner_vs_cloud_sql/e3cof2r/
Previous answers are correct, the main advantages of Spanner are scalability and availability. While you can scale with Cloud SQL, there is an upper bound to write throughput unless you shard -- which, depending on your use case, can be a major challenge. Dealing with sharded SQL was the big problem that Spanner solved within Google.
I would add to the previous answers that Cloud SQL provides managed instances of MySQL or PostgreSQL or SQL Server, with the corresponding support for SQL. If you're migrating from a MySQL database in a different location, not having to change your queries can be a huge plus.
Spanner has its own SQL dialect, although recently support for a subset of the PostgreSQL dialect was added.

Price aside, why ever choose Google Cloud Bigtable over Google Cloud Datastore?

If I have a use case for both huge data storage and searchability, why would I ever choose Google Cloud Bigtable over Google Cloud Datastore?
I've seen a few questions on SO and other sides "comparing" Bigtable and Datastore, but it seems to boil down to the same non-specific answers.
Here's my current knowledge and my thoughts:
Datastore is more expensive.
In the context of this question, let's forget entirely about pricing.
Bigtable is good for huge datasets.
It seems like Datastore is, too? I'm not seeing what specifically makes Bigtable objectively superior here.
Bigtable is better than Datastore for analytics.
How? Why? It seems like I can do analytics in Datastore as well, no problem. Why is Bigtable seemingly the unanimous decision industry-wide for analytics? What value do GMail, eBay, etc. get from Bigtable that Datastore can't provide?
Bigtable is integrated with Hadoop, Spark, etc.
Is Datastore not as well, considering it's built on Bigtable?
From this question, this statement was made in an answer:
Bigtable and Datastore are extremely different. Yes, the datastore is build on top of Bigtable, but that does not make it anything like it. That is kind of like saying a car is build on top of [car] wheels, and so a car is not much different from wheels.
However, this seems analogy seems nonsensical, since the car (including the wheels) intrinsically provides more value than just the wheels of a car by themselves.
It seems at first glance that Bigtable is strictly worse than Datastore, only providing a single index and limiting quick searchability. What am I missing?
Bigtable and Datastore are optimized for slightly different use-cases, and offer different tradeoffs. The main ones are:
Data model:
Bigtable is a wide-column database -- think HBase and Cassandra
Datastore is a document database -- think MongoDB
Note that both of these can be used for key-value use cases
Cost model:
Bigtable charges per provisioned nodes
Datastore is serverless and charges per operation
In general, Bigtable is a good choice if you need:
Fast point-reads and range scans (especially at scale). Bigtable will offer lower latency for key-value lookups, as well as fast scans of contiguous rows - a powerful tool since rows are stored in lexicographic order. If you have simple, predictable query patterns and design your schema well, reading from Bigtable can be incredibly efficient.
High throughput writes (again, especially at scale). This is possible in part because Bigtable is eventually consistent - in exchange you can see big wins in price/performance.
Example use-cases that are great for Bigtable include time series data (for IoT, monitoring, and more - think extremely write heavy workloads and massive amounts of data generated over x units of time), analytics (think fraud detection, personalization, recommendations), and ad-serving (every microsecond counts).
Datastore (or Firestore) is a good choice if you need:
Query flexibility: Datastore offers document support and secondary indexes.
Strong consistency and/or transactions: Bigtable has eventually consistent replication and does not support multi-row transactions.
Mobile SDKs: Datastore and Firestore are incredibly well-integrated with firebase ecosystem.
Example use-cases include mobile and web applications, game state, user profiles, and product catalogs.
To answer a few of your questions explicitly:
Why is Bigtable used for analytics? It's mostly about performance: analytics use-cases are more likely to have large datasets and require high write throughput. It's a lot easier to run into the limits of a database if you're storing clickstream data, as opposed to something like user account information. Fast scans are also important for analytics use-cases: Bigtable allows you to retrieve all of the information you need about a user or a device extremely quickly, which you can process in a batch job or use to create recommendations and analysis on the fly.
Is Bigtable strictly worse than Datastore? Datastore definitely provides more built-in functionality like secondary indexes and document support, and if you need those features, Datastore is a fantastic choice. But that functionality comes with tradeoffs. Bigtable provides perhaps lower-level, but incredibly performant APIs that allow users to make those tradeoffs for themselves: If a user values, say, write performance over secondary indexes, Bigtable is an excellent option. You can think of it as an extremely versatile and powerful infrastructural building block. I actually like the wheel/car analogy: sometimes you don't want the car -- if what you really need is a dirt bike, a set of solid wheels is much more useful :)

AWS Redshift vs Snowflake use cases

I was wondering if anyone has used both AWS Redshift and Snowflake and use cases where one is better . I have used Redshift but recently someone suggested Snowflake as a good alternative . My use case is basically retail marketing data that will be used by handful of analysts who are not terribly SQL savvy and will most likely have reporting tool on top
Redshift is a good product, but it is hard to think of a use case where it is better than Snowflake. Here are some reasons why Snowflake is better:
The admin console is brilliant, Redshift has none.
Scale-up/down happens in seconds to minutes, Redshift takes minutes to hours.
The documentation for both products is good, but Snowflake is better laid
out and more accessible.
You need to know less "secret sauce" to make Snowflake work well. On Redshift you need to know and understand the performance impacts of things like distribution keys and sort keys, at a minimum.
The load processes for Snowflake are more elegant than Redshift. Redshift assumes that your data is in S3 already. Snowflake supports S3, but has extensions to JDBC, ODBC and dbAPI that really simplify and secure the ingestion process.
Snowflake has great support for in-database JSON, and is rapidly enhancing its XML. Redshift has a more complex approach to JSON, and recommends against it for all but smaller use cases, and does not support XML.
I can only think of two cases which Redshift wins hands-down. One is geographic availability, as Redshift is available in far more locations than Snowflake, which can make a difference in data transfer and statement submission times. The other is the ability to submit a batch of multiple statements. Snowflake can only accept one statement at a time, and that can slow down your batches if they comprise many statements, especially if you are on another continent to your server.
At Ajilius our developers use Redshift, Snowflake and Azure SQL Data Warehouse on a daily basis; and we have customers on all three platforms. Even with that choice, every developer prefers Snowflake as their go-to cloud DW.
I evaluated both Redshift(Redshfit spectrum with S3) and SnowFlake.
In my poc, snowFlake is way way better than Redshift. SnowFlake integrates well with Relational/NOSQL data. No upfront index or partition key required. It works amazing without worrying about what way to access the day.
Redshift is very limited and no json support. Its hard to understand the partition. You have to do lot of work to get something done. No json support. You can use redshift specturm as a bandaid to access S3. Good luck with partioning upfront. Once you created partition in S3 bucket, you are done with that and no way to change until unless you redo process all data again to new structue. You will end up sending time to fix these issues instead of working on fixing real business problems.
Its like comparing Smartphone vs Morse code mechine. Redshift is like morse code kind of implementation and its not for mordern development
We recently switched from Redshift to Snowflake for the following reasons:
Real-time data syncing
Handling of concurrent queries
Minimizing of database administration
Providing different amounts of computing power to different Looker users
A more in-depth writeup can be found on our data blog.
I evaluated Redshift and Snowflake, and a little bit of Athena and Spectrum as well. The latter two were non-starters in cases where we had big joins, as they would run out of memory. For Redshift, I could actually get a better price to performance ratio for a couple reasons:
allows me to choose a distribution key which is huge for co-located joins
allows for extreme discounts on three year reserved pricing, so much so that you can really upsize your compute at a reasonable cost
I could get better performance in most cases with Redshift, but it requires good MPP knowledge to setup the physical schema properly. The cost of expertise and complexity offsets some of the product cost.
Redshift stores JSON in a VARCHAR column. That can cause problems (OOM) when querying a subset of JSON elements across large tables, where the VARCHAR column is sized too big. In our case we had to define the VARCHAR as extremely large to accommodate a few records that had very large JSON documents.
Snowflake functionality is amazing, including:
ability to clone objects
deep functionality in handling JSON data
snowpipe for low maintenance loading, auto scaling loads, trickle updates
streams & tasks for home grown ETL
ability to scale storage and compute separately
ability to scale compute within a minute, requiring no data migration
and many more
One thing that I would caution about Snowflake is that one might be tempted to hire less skilled developers/DBAs to run the system. Performance in a bad schema design can be worked around using a huge compute cluster, but that may not be the best bang for the buck. Regardless, the functionality in Snowflake is amazing.

What is the different between AWS Elasticsearch and AWS Redshift

I read the document that both for data analysis and in cluster structure but I don't understand what use case different.
Amazon Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analytics.Amazon Elasticsearch
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. Amazon Redshift
Amazon Redshift is a hosted data warehouse product, while Amazon Elasticsearch is a hosted ElasticSearch cluster.
Redshift is based on PostgreSQL and (afaik) mostly used for BI purpuses and other compute-intensive jobs, the Amazon Elasticsearch is an out-of-the-box ElasticSearch managed cluster (which you cannot use to run SQL queries, since ES is a NoSQL database).
Both Amazon Redshift and Amazon ES are managed services, which means you don't need to do anything in order to manage your servers (this is what you pay for). Using the AWS Console you can add new cluster and you don't need to run any commands on order to install any software - you just need to choose which server to run your cluster on (number of nodes, disk, ram, etc).
If you are not familiar with ElasticSearch you should check their website.
Edit: It is now possible to write SQL queries on ElasticSearch: SQL Support for AWS ElasticSearch
I agree with #IMSoP's assertions above...
To compare the two is like comparing an elephant and a tiger - you're not really asking the right question quite yet.
What you should really be asking is - what are my requirements for my use cases to best fulfill my stakeholder / customer needs, first, and then which data storage technology best aligns with my requirements second...
To be clear - Whether speaking of AWS ElasticSearch Service, or FOSS / Enterprise ElasticSearch (which have signifficant differences, between, even) - ElasticSearch is NOT a Relational Database (RDBMS), nor is it quite a NoSQL (Document Store) Database, either...
ElasticSearch is a Search Engine / Index. It does some things very well, for very specific use cases, however unlike RDBMS data models most signifficantly, ElasticSearch or NoSQL are not going to provide you with FULL ACID Compliance, or Transactional Statement Processing, so if your use case prioritizes data integrity, constrainability, reliability, audit ability, regulatory compliance, recover ability (to Point in Time, even), and normalization of data model for performance and least repetition of data while providing deep cardinality and enforcing model constraints for optimal integrity, "NoSQL and Elastic are not the Droids you're looking for..." and you should be implementing a RDBMS solution. As already mentioned, the AWS Redshift Service is based on PostgreSQL - which is one of the most popular OpenSource RDBMS flavors out there, just offered by AWS as a fully managed solution / service for their customers.
Elastic falls between RDBMS and NoSQL categories, as it is a Search Engine / Index that works most optimally with "single index" type use cases, where A LOT of content is indexed all at once and those documents aren't updated very frequently after the initial bulk indexing,but perhaps the most important thing I could stress is that in my experience it typically does not scale very cost effectively (even managed cluster services) if you want your clusters to perform well, not degrade over time, retain large historical datasets, and remain highly available for your consumers - and for most will likely become cost PROHIBITIVE VERY fast. That said, Elastic Search DOES still have very optimal use cases, so is always worth evaluating against your unique requirements - just keep scalability and cost in mind while doing so.
Lastly let's call NoSQL what it is, a Document Store that stores collections of documents (most often in JSON format) and while they also do indexing, offer some semblance of an Authentication and Authorization model, provide CRUD operability (or even SQL support nowadays, which makes the career Enterprise Data Engineer in me giggle, that SQL is now the preferred means of querying data from their NoSQL instances! :D )- Still NOT a traditional database, likely won't provide you with much control over your data's integrity - BUT that is precisely what "NoSQL" Document Stores were designed to work best for - UNSTRUCTURED DATA - where you may not always know what your data model is going to look like from the start, or your use case prioritizes data model flexibility over enforcing data integrity in general (non mission critical data). Last - while most modern NoSQL Document Stores may have SOME features that appear on the surface to resemble RDBMS, I am not aware of ANY in that category at current that could claim to offer all that a relational database does, with Oracle MySQL's DocumentStore being probably the best of both worlds in my opinion (and not just because I've worked with it every day for the last decade, either...).
So - I hope Developers with similar questions come across this thread, and after reading are much better informed to make the most optimal design decisions for their use cases - because if we're all being honest with ourselves - everything we do in our profession is about data - either generating it, transporting it, rendering it, transforming it....it all starts and ends with data, and making the most optimal data storage decisions for your applications will literally define the rest of your project!
Cheers!
This strikes me as like asking "What is the difference between apples and oranges? I've heard they're both types of fruit."
AWS has an overview of the analytics products they offer, which at the time of writing lists 21 different services. They also have a list of database products which includes Redshift and 10 others. There's no particularly obvious reason why these two should be compared, and the others on both pages ignored.
There is inevitably a lot of overlap between the capabilities of these tools, so there is no way to write an exhaustive list of use cases for each. Their strengths and weaknesses, and the other tools they integrate easily with, will change over time, and some differences are a matter of "taste" or "style".
Regarding the two picked out in the question:
Elasticsearch is a product built by elastic.co, which AWS can manage the installation and configuration for. As its name suggests, its core functionality is based around search - it can be used to build a flexible but fast product search for an e-commerce site, for instance. It's also commonly used along with other tools to search and aggregate logs and monitoring data.
Redshift is a database system built by AWS, based on PostgreSQL but optimised for extremely large data sets. It is designed for "data warehouse" applications, where you want to write complex logical queries against the data, like "how many people in each city bought both a toothbrush and toothpaste, this year compared to last year".
Rather than trying to make an abstract comparison of all the different services available, it makes more sense to start from the use case which you actually have, and see which tool best fits that need.