Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
Based on many sources I understood that both(BigQuery and Bigtable) are considered as 2 different solutions for data sotrage.
For example here written that we can consider bigQuery as a data storage if we need any statistic(for example sums, averages, counts) about huge amount of data. In contrast, BigTable can be consideres like usual NoSql storage.
From another hand I've read snippet from the book where mentioned that:
From this snippet I understood that BigQuery is a tool for querying from anywhere but not for data storage. Could you please clarify ?
BigQuery is a Data Warehouse solution in Google Cloud Platform. In BigQuery you can have two kinds of persistent table:
Native/Internal: In this kind of table, you load data from some source (can be some file in GCS, some file that you upload in the load job or you can even create an empty table). After created, this table will store data on a specific BigQuery's storage system.
External: This kind of table is basically a pointer to some external storage like GCS, Google Drive and BigTable
Furthermore, you can also create temporary tables that exists only during the query execution.
Both BigQuery and BigTable can be used as storage for data.
The point is that BigQuery is more like an engine to run queries and make aggregations on huge amounts of data that can be on external sources or even inside BigQuery's own storage system. That makes it good for data analysis.
BigTable is more like a NoSQL database that can deal with petabytes of data and gives you mechanisms to perform data analysis as well. As you can see in the image below, BigTable should be chosen intead of BigQuery if you need low-latency.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I started working on some ML Ops project on AWS SageMaker, and I have a question about the way of storing/processing data. I have the DB of a company with tens of millions of invoices and clients that should be cleaned/transformed a bit for some classification and regression jobs. What would be the best approach: to make a new DB and develop ETL jobs that take the data from the standard DB, clean and transform it and put it into the “ML DB” (then I directly use this data for my models) or make jobs that take the data from the standard DB, process it and saves it as huge CSV files in S3 buckets? Intuitively, it seems that Relational DB -> process -> NoSql/Relational DB is a better approach than Relational DB -> process -> huge CSV file. I didn’t find anything about this on Google and all the AWS SageMaker docs are using CSV files on S3 as example and are not mentioning anywhere about making ML pipelines directly with relational stored data. What would be the best approach and why?
Your first approach sounds fine:
The original data is left as-is in case you need to repeat the process later (perhaps with changes / improvements).
The system will work for any data source once you plumb it in; i.e. you can re-use the transformation and load parts.
I don't know anything about SageMaker, etc, or the whole CVS thing, but once you have the data in your ML DB you can obviously export it to any format you like later, like CVS.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I have a use-case where QLDB table contains this :
customer-id | customer-name | customer-item-count
I have a need to publish the metrics per customer-id every 5 minutes to cloudwatch and this data is available is in the QLDB table.
Is there a way to do?
QLDB has export jobs to export the content to S3, is there tooling to dump contents to cloudwatch?
Many customers use periodic S3 exports (or Kinesis integration, if you signed up for the preview) to keep some sort of analytics DB up to date. For example, you might copy data into Redshift or ElasticSearch every minute. I don't have code examples to share with you right now. The tricky part is getting the data into the right shape for the destination. For example, QLDB supports nested content while Redshift does not.
Once the data is available and aggregated in the way you wish to query it, it should be a simple matter to run a report and write the results into CloudWatch.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
Does GCP provide a data catalog service like "Microsoft Azure Data Catalog"?
If it does not, what are the options for doing data cataloging?
They just announced it a few hours ago - it seems to be in Beta still: https://cloud.google.com/data-catalog/
Cloud Data Catalog was announced in Google Cloud Next 19. You can see the session here.
Google Cloud DataCatalog support schematized tags (e.g., Enum, Bool, DateTime) and not just simple text tags and you don't need to define which data will be published you just enabled it and start to use (except for Cloud Storage).
The Cloud Data Catalog is in "beta" today (May, 1, 2019) and it is working with:
BigQuery
Cloud Storage (You must set as indexable creating a managed
entry)
Cloud Pub/Sub
Two closest products that match MS’s Azure Data Catalog that I can find from Google Cloud are still in beta:
Google Cloud Dataprep: a tool to visually explore your data in GCP. There you can clean and prepare data for analysis. I wouldn’t say that it is a complete replacement though.
Google Data Studio: it is a GCP’s analog and I think it matches a lot of Data Catalog’s functionality.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
We are researching on creating a Data Lake solution on AWS - similar to what's outlined here - https://aws.amazon.com/blogs/big-data/introducing-the-data-lake-solution-on-aws/
We will be storing all the "raw" data in S3 and load it into EMR or Redshift as needed.
At this stage, I am looking for suggestions on whether to use the ETL or the ELT approach for loading data into Amazon Redshift. We will be using Talend for ETL/ELT.
Should we stage the "raw" data from S3 in Redshift first before transforming it or should we transform the data in S3 and load it into Redshift?
I would appreciate any suggestion/advise.
Thank you.
Definitely ELT.
The only case where ETL may be better is if you are simply taking one pass over your raw data, then using COPY to load it into Redshift, and then doing nothing transformational with it. Even then, because you'll be shifting data in and out of S3, I doubt this use case will be faster.
As soon as you need to filter, join, and otherwise transform information, it is much faster to do it in the DBMS. If you hit a case where the data transformation relies on data that is already in the DW, it will be orders of magnitude faster.
We run hundreds of ELT jobs a day on different DW platforms, performance testing alternative methods of ingesting and transforming data. In our experience the difference between ETL and ELT in an MPP DW can be 2000+ percent.
It depends on the purpose of having Redshift. If your business case is for users to query the data against Redshift (or a front end application using Redshift as backend), then i would not recommend to do ETL in Redshift. In this case, it would be better to perform your business transformations ahead of time (ex: S3->EMR->S3) and then load the processed data to Redshift.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Is there any reason why someone would use Bigtable instead of BigQuery? Both seem to support Read and Write operations with the latter offering also advanced 'Query' operations.
I need to develop an affiliate network (thus I need to track clicks and 'sales') so I'm quite confused by the difference because BigQuery seems to be just Bigtable with a better API.
The difference is basically this:
BigQuery is a query Engine for datasets that don't change much, or change by appending. It's a great choice when your queries require a "table scan" or the need to look across the entire database. Think sums, averages, counts, groupings. BigQuery is what you use when you have collected a large amount of data, and need to ask questions about it.
BigTable is a database. It is designed to be the foundation for a large, scaleable application. Use BigTable when you are making any kind of app that needs to read and write data, and scale is a potential issue.
This may help a bit in deciding between different datastore solutions that Google cloud offers (Disclaimer! Copied from Google Cloud page)
If your requirement is a live database, BigTable is what you need (Not really an OLTP system though). If it is more of an analytics kind of purpose, then BigQuery is what you need!
Think of OLTP vs OLAP; Or if you are familiar with Cassandra vs Hadoop, BigTable roughly equates to Cassandra, BigQuery roughly equates to Hadoop (Agreed, it's not a fair comparison, but you get the idea)
https://cloud.google.com/images/storage-options/flowchart.svg
Note
Please keep in mind that Bigtable is not a relational database and it does not support SQL queries or JOINs, nor does it support multi-row transactions. Also, it is not a good solution for small amounts of data. If you want an RDBMS OLTP, you might need to look at cloudSQL (mysql/ postgres) or spanner.
Cost Perspective
https://stackoverflow.com/a/34845073/6785908. Quoting the relevant parts here.
The overall cost boils down to how often you will 'query' the data. If
it's a backup and you don't replay events too often, it'll be dirt
cheap. However, if you need to replay it daily once, you will start
triggering the 5$/TB scanned very easily. We were surprised too how
cheap inserts and storage were, but this is ofc because Google expects
you to run expensive queries at some point in time on them. You'll
have to design around a few things though. E.g. AFAIK streaming
inserts have no guarantees of being written to the table and you have
to poll frequently on tail of list to see if it was really written.
Tailing can be done efficiently with time range table decorator,
though (not paying for scanning whole dataset).
If you don't care about order, you can even list a table for free. No
need to run a 'query' then.
Edit 1
Cloud spanner is relatively young, but is powerful and promising. At least, google marketing claims that it's features are best of both worlds (Traditional RDBMS and noSQL)
BigQuery and Cloud Bigtable are not the same. Bigtable is a Hadoop based NoSQL database whereas BigQuery is a SQL based datawarehouse. They have specific usage scenarios.
In very short and simple terms;
If you don’t require support for ACID transactions or if your data is not highly structured, consider Cloud Bigtable.
If you need interactive querying in an online analytical processing (OLAP) system, consider BigQuery.