Recently I cleared the google cloud PCA exam but want to clarify one question which I have doubt.
" You are tasked with building online analytical processing (OLAP) marketing analytics and reporting tools. This requires a relational database that can operate on hundreds of terabytes of data. What is the Google-recommended tool for such applications?"
What is the answer? Is it Bigquery or cloud spanner? as there are 2 parts in question. If we consider it for OLAP then it is Bigquery and for 2nd part for RDBMS it should be Cloud Spanner.
Appreciate it if I can have some clarification.
Thanks
For Online Analytical Processing (OLAP) databases, consider using BigQuery.
When performing OLAP operations on normalized tables, multiple tables have to be JOINed to perform the required aggregations. JOINs are possible with BigQuery and sometimes recommended on small tables.
You can check this documentation for further information.
BigQuery for OLAP and Google Cloud Spanner for OLTP.
Please check this other page for more information about it.
I agree that the question is confusing.
But according to the official documentation :
Other storage and database options
If you need interactive querying in an online analytical processing
(OLAP) system, consider BigQuery.
However BigQuery is not considered relational database.
The big query does not provide you relationship between tables but you can join them freely.
If your performance falls cluster then partition on the joining fields.
Is it possible to create relationships between tables?
Some more literature if some want to go into the details.
By using MapReduce, enterprises can cost-effectively apply parallel
data processing on their Big Data in a highly scalable manner, without
bearing the burden of designing a large distributed computing cluster
from scratch or purchasing expensive high-end relational database
solutions or appliances.
https://cloud.google.com/files/BigQueryTechnicalWP.pdf
Hence Bigquery
Related
I am new to BQ. I have a table with around 200 columns, when i wanted to get DDL of this table there is no ready-made option available. CATS is not always desirable.. some times we dont have a refernce table to create with CATS, some times we just wanted a simple DDL statement to recreate a table.
I wanted to edit a schema of bigquery with changes to mode.. previous mode is nullable now its required.. (already loaded columns has this column loaded with non-null values till now)
Looking at all these scenarios and the lengthy solution provided from Google documentation, and also no direct solution interms of SQL statements rather some API calls/UI/Scripts etc. I feel not impressed with Bigquery with many limitations. And the Web UI from Google Bigquery is so small that you need to scroll lot many times to see the query as a whole. and many other Web UI issues as you know.
Just wanted to know how you are all handling/coping up with BQ.
I would like to elaborate a little bit more to #Pentium10 and #guillaume blaquiere comments.
BigQuery is a serverless, highly scalable data warehouse that comes with a built-in query engine, which is capable of running SQL queries on terabytes of data in a matter of seconds, and petabytes in only minutes. You get this performance without having to manage any infrastructure.
BigQuery is based on Google's column based data processing technology called dremel and is able to run queries against up to 20 different data sources and 200GB of data concurrently. Prediction API allows users to create and train a model hosted within Google’s system. The API recognizes historical patterns to make predictions about patterns in new data.
BigQuery is unlike anything that has been used as a big data tool. Nothing seems to compare to the speed and the amount of data that can be fitted into BigQuery. Data views are possible and recommended with basic data visualization tools.
This product typically comes at the end of the Big Data pipeline. It is not a replacement for existing technologies but it complements them. Real-time streams representing sensor data, web server logs or social media graphs can be ingested into BigQuery to be queried in real time. After running the ETL jobs on traditional RDBMS, the resultant data set can be stored in BigQuery. Data can be ingested from the data sets stored in Google Cloud Storage, through direct file import or through streaming data
I recommend you to have a look for Google BigQuery: The Definitive Guide: Data Warehousing, Analytics, and Machine Learning at Scale book about BigQuery that includes walkthrough on how to use the service and a deep dive of how it works.
More than that, I found really interesting article for Data Engineers new to BigQuery, where you can find consideration regarding DDL and UI and best practices on Medium.
I hope you find the above pieces of information useful.
In this repository, its author mentions that we can stage OLAP cubes in Cassandra or S3:
Once the data is in Redshift, our chief goal is for the BI apps to be
able to connect to Redshift cluster and do some analysis. The BI apps
can either directly connect to the Redshift cluster or go through an
intermediate stage where data is in the form of aggregations
represented by OLAP cubes.
How is it possible? How would that work? Am I missing any essential concept? As I understand OLAP cubes are a special data structure that exists in OLAP databases. Does he maybe mean specific pre-calculated combinations of dimensions and facts stored in a OLTP-oriented database, like Cassandra?
Key features of OLAP are:
pivoting
slicing
dicing
drilling
And Redshift can do this.
It's architecture is aimed to solve OLAP and BI tasks. See amazon-redshift-developer-guide
Amazon Redshift is specifically designed for online analytic processing (OLAP) and business intelligence (BI) applications, which require complex queries against large datasets. Because it addresses very different requirements, the specialized data storage schema and query execution engine that Amazon Redshift uses are completely different from the PostgreSQL implementation. For example, where online transaction processing (OLTP) applications typically store data in rows, Amazon Redshift stores data in columns, using specialized data compression encodings for optimum memory usage and disk I/O. Some PostgreSQL features that are suited to smaller-scale OLTP processing, such as secondary indexes and efficient single-row data manipulation operations, have been omitted to improve performance.
But the line between terms is very smooth.
As Diana Shealy said:
Stop Abusing OLTP as OLAP
There’s a lot of confusion in the market between OLTP and OLAP, and due to the high price of commercial OLAPs, startups and budget-constrained developers have gone on to abuse an OLTP database as an OLAP database. The abuse falls into two categories:
An often multi-shard MySQL database with application layer scripting to perform historical event data analysis. Although this setup is extremely common, it is one of the least productive ways to approach analytics. MySQL is not optimized in any way for reading large ranges of data and its support for analytic functions is weak. As there are multiple alternatives, avoid this “inexpensive” solution because you’ll be paying the price in other places eventually.
Using PostgreSQL as an OLAP layer. This is a more legitimate choice than above for starting an analytics platform because of Postgres’s solid analytic User Defined Functions (UDFs). Also, thanks to its c-store extension, PostgreSQL can be turned into a columnar database, making it an affordable alternative to commercial OLAPs.
Finally, if you are considering moving from OLTPs abused as OLAPs to “real” OLAPs like Redshift, I encourage you to learn how to use Redshift’s COPY Command so that you can start seeing your data inside Redshift.
As for your questions:
How is it possible?
It's possible due to Redshift architecture (column database) and analytical features such as:
Window functions
Data Warehouse System Architecture
Performance
Columnar Storage
Internal Architecture and System Operation
Workload Management
Aggregate functions
How would that work?
See System and Architecture Overview for a detailed explanation of the Amazon Redshift data warehouse system architecture.
(Some links are already mentioned before in this post)
Essential concept?
Am I missing any essential concept?
I'd suggest more rely on technical details of specific solution instead of marketing terms. In the end, practical tasks are not solved by software naming or marketing, but with it's real functionality.
What's really important in DB landscape - is to consider two theorems:
CAP theorem
According to Iron triangle of CAP theorem, you can choose two points of three DB architecture components:
* consistency
* availability
* persistence
PIE theorem
Rick Houlihan of Amazon had a speech on choosing the DB archotecture. In addition to the CAP theorem, he also presented PIE theorem:
The PIE theorem posits that you can choose two out of three desirable features in a data system:
Pattern Flexibility
Efficiency
Infinite Scale
And Redshift is on PI dimension of the PIE triangle
Data structure
As I understand OLAP cubes are an special data structure that exists in OLAP databases. Does he maybe mean specific pre-calculated combinations of dimensions and facts stored in a OLTP-oriented database, like Cassandra?
Both OLAP aggregated data structures and Redshift distribution styles aimed one goal: make queries faster.
Column DB, distribution, parallel queries and other features are good for analytical tasks.
UPD
In comments you asked if Cassandra can work as OLAP service.
Cassandra and S3 can be used as a storage for pre-calculated aggregated data of dimensions.
I am novice in GCP stack so I am so confused about amount GCP technologies for storing data:
https://cloud.google.com/products/storage
Although google cloud spanner is not mentioned in the article above I know that it is exist and iti is used for data storage: https://cloud.google.com/spanner
From my current view I don't see any significant difference between cloud sql(with postgres under the hood) and cloud spanner. I found that it has a bit different syntax but it doesn't answer when I should prefer this techology to spring cloud sql.
Could you please explain it ?
P.S.
I consider spring cloud sql as a traditional database with automatic replication and horizontal scalability managed by google.
There is not a big difference between them in terms on what they do (storing data in tables). The difference is how they handle the data in a small and big scale
Cloud Spanner is used when you need to handle massive amounts of data with an elevated level of consistency and with a big amount of data handling (+100,000 reads/write per second). Spanner gives much better scalability and better SLOs.
On the other hand, Spanner is also much more expensive than Cloud SQL.
If you just want to store some data of your customer in a cheap way but still don't want to face server configuration Cloud SQL is the right choice.
If you are planning to create a big product or if you want to be ready for a huge increase in users for your application (viral games/applications) Spanner is the right product.
You can find detailed information about Cloud Spanner in this official paper
The main difference between Cloud Spanner and Cloud SQL is the horizontal scalability + global availability of data over 10TB.
Spanner isn’t for generic SQL needs, Spanner is best used for massive-scale opportunities. 1000s of writes per second, globally. 10,000s - 100,000s of reads per second, globally.
Above volume is extremely difficult to achieve with NORMAL SQL / MySQL without doing complex sharding of the database. Spanner deals with all this AND allows ACID updates (which is basically impossible with sharded databases). They accomplish this with super-accurate clocks to manage conflicts.
In short, Spanner is not for CRM databases, it is more for supermassive global data within an organisation. And since Spanner is a bit expensive (compared to cloud SQL), the project should be large enough to justify the additional cost of Spanner.
You can also follow this discussion on Reddit (a good one!): https://www.reddit.com/r/googlecloud/comments/93bxf6/cloud_spanner_vs_cloud_sql/e3cof2r/
Previous answers are correct, the main advantages of Spanner are scalability and availability. While you can scale with Cloud SQL, there is an upper bound to write throughput unless you shard -- which, depending on your use case, can be a major challenge. Dealing with sharded SQL was the big problem that Spanner solved within Google.
I would add to the previous answers that Cloud SQL provides managed instances of MySQL or PostgreSQL or SQL Server, with the corresponding support for SQL. If you're migrating from a MySQL database in a different location, not having to change your queries can be a huge plus.
Spanner has its own SQL dialect, although recently support for a subset of the PostgreSQL dialect was added.
Currently we are migrating Dynamodb table to Spanner. Since DynamoDb is a nosql database with indexing, it become a difficult task to migrate NOSQL to relational database. The only reason we are migrating it to Spanner is because of secondary indexing. But after migrating few tables, we are witnessing the latency issues in Spanner. Initially we were planned to migrate it to Cloud BigTable, but unfortunately it doesn't support secondary index. Now because of latency issue and high read/write traffic, Spanner performance is going down. Do we have any other data stores in GCP, which would be more suitable with this kind of use case, where we can have nosql as well as secondary index? We have around 200 TB's of data in DynamoDb.
The Google Spanner documentation Quotas & Limits, for improved performance, you should have a node for every 2 TB of data that you have on it. Considering that, I would recommend you to take a look at your nodes and raise the number of them that you have, active right now, yo improve the performance of your database.
On this documentation here, you have the best practices to configure a Spanner as it's best possible performance.
In case this doesn't help, could you please take a look at the documentation Troubleshooting performance regressions? This way, you can take a further look at what might be affecting the performance of your Spanner.
Let me know if the information helped you!
Go to firebase in datastore mode. It has secondary indexes and basically is serverless and practically unlimited in throughput. And is a nosql db as well
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Is there any reason why someone would use Bigtable instead of BigQuery? Both seem to support Read and Write operations with the latter offering also advanced 'Query' operations.
I need to develop an affiliate network (thus I need to track clicks and 'sales') so I'm quite confused by the difference because BigQuery seems to be just Bigtable with a better API.
The difference is basically this:
BigQuery is a query Engine for datasets that don't change much, or change by appending. It's a great choice when your queries require a "table scan" or the need to look across the entire database. Think sums, averages, counts, groupings. BigQuery is what you use when you have collected a large amount of data, and need to ask questions about it.
BigTable is a database. It is designed to be the foundation for a large, scaleable application. Use BigTable when you are making any kind of app that needs to read and write data, and scale is a potential issue.
This may help a bit in deciding between different datastore solutions that Google cloud offers (Disclaimer! Copied from Google Cloud page)
If your requirement is a live database, BigTable is what you need (Not really an OLTP system though). If it is more of an analytics kind of purpose, then BigQuery is what you need!
Think of OLTP vs OLAP; Or if you are familiar with Cassandra vs Hadoop, BigTable roughly equates to Cassandra, BigQuery roughly equates to Hadoop (Agreed, it's not a fair comparison, but you get the idea)
https://cloud.google.com/images/storage-options/flowchart.svg
Note
Please keep in mind that Bigtable is not a relational database and it does not support SQL queries or JOINs, nor does it support multi-row transactions. Also, it is not a good solution for small amounts of data. If you want an RDBMS OLTP, you might need to look at cloudSQL (mysql/ postgres) or spanner.
Cost Perspective
https://stackoverflow.com/a/34845073/6785908. Quoting the relevant parts here.
The overall cost boils down to how often you will 'query' the data. If
it's a backup and you don't replay events too often, it'll be dirt
cheap. However, if you need to replay it daily once, you will start
triggering the 5$/TB scanned very easily. We were surprised too how
cheap inserts and storage were, but this is ofc because Google expects
you to run expensive queries at some point in time on them. You'll
have to design around a few things though. E.g. AFAIK streaming
inserts have no guarantees of being written to the table and you have
to poll frequently on tail of list to see if it was really written.
Tailing can be done efficiently with time range table decorator,
though (not paying for scanning whole dataset).
If you don't care about order, you can even list a table for free. No
need to run a 'query' then.
Edit 1
Cloud spanner is relatively young, but is powerful and promising. At least, google marketing claims that it's features are best of both worlds (Traditional RDBMS and noSQL)
BigQuery and Cloud Bigtable are not the same. Bigtable is a Hadoop based NoSQL database whereas BigQuery is a SQL based datawarehouse. They have specific usage scenarios.
In very short and simple terms;
If you don’t require support for ACID transactions or if your data is not highly structured, consider Cloud Bigtable.
If you need interactive querying in an online analytical processing (OLAP) system, consider BigQuery.