gcp BigQuery for Dimensional Star Schema Data Warehouse build performance - google-cloud-platform

Google states that BigQuery is for DWH'es that have append by and large and fewer updates.
For a star schema based DWH with optional fact table attributes that could be updated and dimensions that are historized, then is this a goer, or do we need the Redshift approach of small staging tables generated with the new or updated data that needs to be part of an UPSERT query?
Is this type of approach possible in BigQuery using Spark?
spark.sql(""" MERGE INTO CUSTOMERS_AT_REST
USING CUST_DELTA
ON CUSTOMERS_AT_REST.col_key = CUST_DELTA.col_key
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *
""")
It is all good on delta on gcp cloud storage.

Related

Data Fusion replication pipeline is not syncing data in Google Bigquery

Hi we want to replicate the data from Mysql(source) to GoogleBigquery(destination) we adopted the method described by google Docs with Data fusion replication pipeline as mentioned in Link
https://cloud.google.com/data-fusion/docs/tutorials/replicating-data/mysql-to-bigquery
Berief of what we are doing:
Enabling bin log in MY SQL for CDC(Change data Capture)
creating a replication pipeline in data fusion
starting the pipeline and syncing the data
we are successfully able to create MySql data in comupute engine and enabling bin-log for CDC and provided all necessary permission to user for the data replication pipeline in my SQL
we are successful in creating a data Fusion instance and able to create a replication pipeline
replication pipeline is able to fetch our SQL database details and target Big query is also set
On starting the pipeline it is tracking the Changes successfully (Insert,update and delete ) and table Schema is also created in Bigquery Successfully automatically.
But we are getting PROBLEM that no data is getting transsferred to Bigquery table. In log what i have seen is loading batch of 1 event in to statging Bucket
sharing the screenshot also
able to fetch every change from MYSQL but data is not transferring to bigquery
table schema was created but data is not transferred
loading batch of 1 event in to statging Bucket we are using developer mode and waited for more than 90 mins
The issue might be happening because there may be a schema/data type mismatch with the BigQuery table and the source MYSQL database table on the columns.
For example: if you have a column in source table, in BigQuery this column is of INT64 datatype with a length of 19, while in the source database table, it is Integer type with a length of 10, so you need to update the length of columns as per your datasize.

Using fake timestamps to create partitions on Google BigQuery

Google BigQuery (BQ) allows you to create a partition using timestamp or date types only.
99% of my data has a very clear selector, idClient. I've created to my customer's views with a predicate like idClient = code so the privacy is guaranteed.
The problem with this strategy is that there are customers with 5M rows and others with 200K and as BQ does not have indexes, they are always processing data from each other (and the costs are rising).
I am intending to create a timestamp field where each customer will have a different timestamp that will be repeated for every Insert in every customer sensitive table and thus I can query by timestamp by fixing it as it would be with a standard ID.
Does this make any sense? If BQ was an indexed database I'd be concerned about skewed data but as it is always full table scan, I think I'd have only benefits and no downsides.
The solution for your problem is to add Cluster field to your table which is equivalent to an Index in other databases
This link provides the basic on how to use cluster field
Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns
Note: When using cluster field BigQuert dryRun doesn't show the cost improvement which can only be seen post-execution

How to monitor the number of records loaded into BQ table while using big query streaming?

We are trying to insert data into bigquery (streaming) using dataflow. Is there a way where we can keep a check on the number of records inserted into Bigquery? We need this data for reconciliation purpose.
Add a step to your dataflow which calls Google API Tables.get OR run this query before and after the flow (Both are equally good).
select row_count, table_id from `dataset.__TABLES__` where table_id = 'audit'
As an example, the query returns this
You also may be able to examine the "Elements added" by clicking on the step writing to bigquery in the Dataflow UI.

AWS Redshift purge policy automation

AWS Redshift team recommend using TRUNCATE in order to clean up a large table.
I have a continuous EC2 service that keeps adding rows to a table. I would like to apply some purging mechanism, so that when the cluster is near full it will auto delete old rows (say using the index column).
Is there some best practice for doing that?
Do I need to write my own code to handle that? (if so is there already a Python script for that that I can use e.g. in a Lambda function?)
A common practice when dealing with continuous data is to create a separate table for each month, eg Sales-2018-01, Sales-2018-02.
Then create a VIEW that combines the tables:
CREATE VIEW sales AS
SELECT * FROM Sales-2018-01
UNION
SELECT * FROM Sales-2018-02
Then, create a new table each month and remove the oldest month from the View. This effectively gives a 12-month rolling view of the data.
The benefit is that data does not have to be deleted from tables (which would then require a VACUUM). Instead, the old table can simply be dropped, or kept around for historical reporting with a different View.
See: Using Time Series Tables - Amazon Redshift

Columnar database queries in Amazon Redshift

I'm learning Amazon Redshift. Heard that it is very powerful storage on cloud and works very fast on data where aggregate operations are required because it stores data column-wise.
Am not able to find any example queries? Could someone share with me some examples of Aggregate queries running on Amazon Redshift? Is it different from normal relation database queries?
You are correct -- Amazon Redshift is a columnar database. This means that data is stored on disk per column, making operations on a column very fast. For example, adding the Sales column for a particular value in the Country column only requires accessing two columns rather than all columns in a table.
Other benefits are that data in Redshift is compressed (which works well with the columnar concept, because each column uses its own compression method based on the data stored) and the fact that it is a clustered database, so compute and storage can be scaled by adding additional nodes.
Amazon Redshift presents itself as a PostgreSQL database, so you just use industry-standard SQL to query data. No changes to queries are required.
However, you can optimize Redshift by wisely choosing a Distribution Key for each table that determines how data is distributed amongst nodes, and carefully select the Sort Key, which determines how data is stored on each node. Put simply, data should be distributed by how you JOIN tables and should be sorted by what you use in WHERE statements.
As for sample queries... it totally depends upon your data! Queries look exactly the same as normal SQL.