BigQuery: clustering column with millions of cardinality

BigQuery: clustering column with millions of cardinality - google-cloud-platform

I have a BigQuery table, partitioned by date (for everyday there is one partition).
I would like to add various columns sometimes populated and sometimes missing and a column for a unique-id.
The data need to be searchable through a unique id. The other use case is to aggregate per column.
This unique id will have a cardinality of millions per day.
I would like to use the unique-id for clustering.
Is there any limitation on this? Anyone has tried it?

It's a valid use case to enable clustering on an id column, the amount of values shouldn't cause any limitations.

Related

What is best practice to deal with missing data according to Kimball?

I have a data base with the following tables:
Customers, Invoices, Salesman, Target.
The ones concerned about my question are Customers, Invoices.
There are customersIDs used in the Invoices but doesn't exist in the Customers table.
If I used only the customers from Customers Table, my customer dimension would be incomplete.
My solution is to append these IDs from Invoices to Customers and fill other columns in the Customers table with nulls.
I don't know if this is the best approche according to Kimball?
also, if it is a good solution, how can I add accomplish it with Power bi desktop?
Customers table: "generated Data"
Invoice table:
..... just a sample the table is thousands of rows.

There's two points here:
Firstly, (in import mode at least) PBI already creates the "blank row" for items present in your fact table but missing from your dimension table for precisely this scenario. If you don't need the granularity of each individual missing customer id, then you don't need to do anything.
Secondly, if you need to to retain that granularity then your approach is the correct one. The way to do this in Power Query is as follows:
Create a new query which takes your customer dimension table and does a left outer join on customer id with your invoice fact table.
Expand the newly joined table but retain only the new customer id column.
Remove all columns apart from the new customer id column.
Remove duplicates
You now have a list of missing customer ids. Ensure the column name is the same as the column name of you customer id in the customer dimension table. Append this to the original customer dimension query and the nulls will be filled in automatically for the missing columns.

Please keep in mind that It is Kimball, not Kimble.
There are 4 steps of DWH Methodology:
1) Understand Business Process (What your process is actually measuring?)
2) Deciding the grain (It means what every row in your fact table actually represents?)
3) Deciding Dimensions (Ask Where-What-Who-Where-How-HowMany-HowMuch to your grain declaration formed together with business processing)
4) Define Facts (Metrics)
According to this order, You define Dimension tables before building your fact tables: If your dimension table , Customer table in this case, is missing in terms of customers available in your fact table, My biggest biggest advice according to the DWH Dimensional Modeling is to set your customer table right!!! Define every piece of customer in your dimension table!!!! Then populate your Fact table with records:
[Customer ID] in Customer Table : PRIMARY KEY
[CustomerID] in Invoice Table : FOREIGN KEY
SQL and Power BI reacts very differently in your problem:
1) Power BI has no referential integrity concept: It adds a blank row to your dimension table in such a case.
2) SQL gives referential integrity error, and you can't even add rows to your fact table. I support SQL in this case personally!!!!
Finally: Use some ETL tool(SSIS, Talend, ODI or even Power Query) to make your dimension table as accurate as possible:
For example:
Do not leave any column value as null!
If an unknown date exists, put a default date value like '1900-12-31'
If an unknown textual property, put in keywords 'unknown','not available' etc..
Because dimensional table are sources of querying in SQL statements; and different SQL Vendors (SQL Server, Oracle, MySQL) has to deal with NULL values in a different way, and this cause problems in terms of performance wise!!

Is there a query or a page where I can see the details of my cost specific to Bigquery?

I can see my total BigQuery cost from the "billing" section.
However, I need to see data such as,
Which table costs me how much? I mean, I need to see the cost of each table individually.
How much cost has been created by the queries made to that table in the last month?
etc.
I would be very happy if you could help with this. I have too many tables to calculate the cost based on the dimensions of the individual tables.

I have published an article about Reducing your BigQuery bills with BI Engine capacity orchestration
which features a query like:
DECLARE var_day STRING DEFAULT '2021-09-09';
SELECT
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.createTime,
round(5* (protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.totalProcessedBytes/POWER(2,40) ),2) AS processedBytesCostProjection,
round(5* (protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.totalBilledBytes/POWER(2,40) ),2) AS billedBytesCostInUSD
FROM
`<dataset_auditlogs>.cloudaudit_googleapis_com_data_access_*`
WHERE
_TABLE_SUFFIX >= var_day and protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.createTime>=TIMESTAMP(var_day)
AND protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.eventName="query_job_completed"
AND protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.totalProcessedBytes IS NOT NULL
ORDER BY
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics. totalProcessedBytes DESC
The query uses a flat rate of 5 USD to calculate the cost of a 1TB on-demand query according to the GCP costs table.
The output is this:
by adding another column:
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.query
you will get the raw query that you can use to optimize your query.
If you want to go further you can use
...job.jobStatistics.referencedTables that lists ALL the tables the query touches to actually see and do some filtering on the tables you want.
the json view helps you to identify the right attribute to query and filter on

How to circumvent SPICE limitations (500 M rows) to create a QuickSight dashboard for a big data set?

My goal is to quickly & dynamically visualize a big data set (> 500 M rows) using QuickSight. To achieve quick query times, it's necessary to load all of the data into SPICE. However, AWS currently has a hard limit for the maximum number of rows that can be imported into SPICE for a single data set, which is 500 M rows. I currently don't see any option that could be used to visualize all of the data. Here are things that I already considered:
Splitting the full data set into individual QS datasets: the problem with this approach is that QuickSight requires that each visual has a single dataset as an input, so values from multiple datasets cannot be shown in the same visual. I'm aware that multiple datasets can be used within one dashboard but that would not suit the use-case of having a single plot visualizing the data.
Pivoting the table: the input table has a lot of rows, so changing the format from long to wide table would circumvent the SPICE row limitations. However, QuickSight doesn't seem to support using an array of columns a y-values to be plotted.
Creating a dataset per visualization: Certain visualizations can theoretically be defined using fewer values than in the original data set. For example, to create a box plot over a set of groups, we mainly need the quartile values for each of the groups to be plotted, rather than the full data set, which would allow us to be below the SPICE limitation. However, QuickSight doesn't allow creating custom plots such as creation of a box plot where quartiles are already pre-processed.
Currently, the only viable approach I see is to create a dashboard per user, since most users would only be interested in a subset of rows from the full data set.

Irrespective of the approach taken, unfortunately, this limitation forces us to do some compromises.
Depending on the number of users, creating a dataset per user might become a headache to manage. So, I would suggest that if possible you use datasets that capture groups of users (example by user group, or user's country).
Pivoting the table might make it harder to build some visuals. As you said, if you pivot multiple values from different rows into an array field, then you would not be able to extract these easily in analyses (you could use string functions and to to extract them that way but there are limitations around this approach too).
Also creating a dataset per visualisation has maintenance overhead in that you would need to update and re-ingest the dataset most times when changing visualisations.
Some other approaches you might consider:
Aggregate multiple rows together Example if your dataset has multiple rows for each user within the same minute, you could aggregate all these into 1 row and summing up values within that minute. The aggregation period should be as large as possible but keep in mind that this will affect the time granularity in your analyses/dashboards
Prune old data If you are more interested in recent data, then you could add a filter to only keep say 1 month of activity. You could then have other non-SPICE (Direct Query) datasets that do not have this restriction but reports would be slower on older data.
Cache in an external database You could load your data into some data warehousing database (such as AWS Redshift) and then not use SPICE in QuickSight. Of course, this will probably get more expensive.

BI - How to design a data model to compare orders price vs cost in a star schema

I'm struggling triying to make a star schema from a set of tables with different origins, two SQL databases, Excel files and CSV reports, it's a bit of a puzzle.
The initial tables that they provide me are set like this:
The important points of this set of tables are:
In Products table IdProduct is not unique, because one product can be make with one type of machine in factory A, and another type of machine in factory B, so it's one row for every Factory/Machine/IdProduct combination.
The OrderItems table has mixed rows with materials and products, so you have all the products in the Order and all the materials used in each product of the same Order.
The cost of the material changes daily and is updated in the system from where I get the OrderItems table.
The delivery cost is different for each order.
The packaging and fix costs are updated once a week.
The product price changes from order to order (it is set taking into account the client, day and size of the order).
I got to this model dividing the OrderItems in products and costs (materials), and joining with them the fixed costs and the packaging costs, I haven't joined the delivery costs, but i end up with two fact tables and a snowflake schema:
I am thinking in Region, Factory, Machine, Date, Product, and a compound of cost concepts (materials, fixed costs, etc.) as dimensions, and the total amounts and quantities as facts. This to compare the total sales to the total costs around the different dimensions.
I just wanna know if this is the correct path or there is a better way, I tried to search more about the subject but the case is too specific so I get nothing.
Thanks in advance for your answers.

Choosing a star schema over a snowflake schema ?
The Star schema is in a more de-normalized form and can be better for performance. Along the same records the Star schema uses less foreign keys so the query execution time is limited. In almost all cases the data retrieval speed of a Star schema has the Snowflake beat.
But you can split your work as data marts :
A data mart is a simple form of data warehouse focused on a single
subject or line of business.
A data mart can contain star schemas and other tables for more than one warehouse pack. For example, a single data mart might contain data for the your reporting needs related to costs

Redshift: Aggregate data on large number of dimensions is slow

I have an Amazon redshift table with about 400M records and 100 columns - 80 dimensions and 20 metrics.
Table is distributed by 1 of the high cardinality dimension columns and includes a couple of high cardinality columns in sort key.
A simple aggregate query:
Select dim1, dim2...dim60, sum(met1),...sum(met15)
From my table
Group by dim1...dim60
is taking too long. The explain plan looks simple just a sequential scan and hashaggregate on the able. Any recommendations on how I can optimize it?

1) If your table is heavily denormalized (your 80 dimensions are in fact 20 dimensions with 4 attributes each) it is faster to group by dimension keys only, and if you really need all dimension attributes join the aggregated result back to dimension tables to get them, like this:
with
groups as (
select dim1_id,dim2_id,...,dim20_id,sum(met1),sum(met2)
from my_table
group by 1,2,...,20
)
select *
from groups
join dim1_table
using (dim1_id)
join dim2_table
using (dim2_id)
...
join dim20_table
using (dim20_id)
If you don't want to normalize your table and you like that a single row has all pieces of information it's fine to keep it as is since in a column database they won't slow the queries down if you don't use them. But grouping by 80 columns is definitely inefficient and has to be "pseudo-normalized" in the query.
2) if your dimensions are hierarchical you can group by the lowest level only and then join higher level dimension attributes. For example, if you have country, country region and city with 4 attributes each there's no need to group by 12 attributes, all you can do is group by city ID and then join city's attributes, country region and country tables to the city ID of each group
3) you can have the combination of dimension IDs with some delimiter like - in a separate varchar column and use that as a sort key

Sequential scans are quite normal for Amazon Redshift. Instead of using indexes (which themselves would be Big Data), Redshift uses parallel clusters, compression and columnar storage to provide fast queries.
Normally, optimization is done via:
DISTKEY: Typically used on the most-JOINed column (or most GROUPed column) to localize joined data on the same node.
SORTKEY: Typically used for fields that most commonly appear in WHERE statements to quickly skip over storage blocks that do not contain relevant data.
Compression: Redshift automatically compresses data, but over time the skew of data could change, making another compression type more optimal.
Your query is quite unusual in that you are using GROUP BY on 60 columns across all rows in the table. This is not a typical Data Warehousing query (where rows are normally limited by WHERE and tables are connected by JOIN).
I would recommend experimenting with fewer GROUP BY columns and breaking the query down into several smaller queries via a WHERE clause to determine what is occupying most of the time. Worst case, you could run the results nightly and store them in a table for later querying.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js