How to decide what table partitioning strategy to use in QuestDB?

How to decide what table partitioning strategy to use in QuestDB? - questdb

I am designing a table of metrics in QuestDB. There will be few million rows per day they spread evenly thought the day. The rows are around 200 bytes, all numbers and timestamp. The data reading will span across multiple days usually, up to a year on edge cases.
I cannot decide I should make it partitioned by DAY or MONTH (or even YEAR). I understand I have to make this decision upfront since there is no way to switch from one to another.

Writing benefits from smaller partitions, reading from larger.
Small and large are relative to the RAM of the machine. You want a partition to fully fit the RAM so as a rule of thumb to be up to 10Gb for effective writing.
If your record is 200 bytes than it will be up to 400Mb of data 2 Million rows per day. This give you a freedom to use Monthly partitions which will take around 12 Gb per month if you can allocate 32-64 Gb of RAM for QuestDB.
Monthly partitions will be effective on multi-day scanning queries for longs period of time like monthly or annual spans. Daily partitions will see latency connected with multiple files opening on "cold" runs.

Related

How many rows are required by partition to have good performances in BigQuery?

I receive every day 100 rows from an application. Good practices in my company suggest to partition every table by day. I dont think is good to do this on the new table that I will create to daily insert a hundred of rows. I want to partition the data by year, is it good?
How many rows by partition are required for the best performances?

It really also depends on the queries that you are going to execute on this table that is what kind of date filters are going to use and joins on what columns. Refer to below answer which will really help you to decide on this.
Answer1
Answer2

Keep in mind that the number of partitions is limited (to 4000). Therefore partitioning is great for low cardinality. Per day, is perfect (about 11 years -> 4000 days).
If you have higher cardinality, customer ID for example (and I hope you have more than 4000 customers!), clustering is the solution to speed up the request.
When you partition, and cluster, your data, you create small bag. Lesser the data to process (load, read, store in cache (...)) you have, faster will be your query! Of course, on only 100 rows, you won't see any differences

How many partitions does my athena table have?

I have this query:
SHOW PARTITIONS tablename;
Result is:
dt=2018-01-12
dt=2018-01-20
dt=2018-05-21
dt=2018-04-07
dt=2018-01-03
Do I have 5 partitions or 1, being the date? What counts towards athena's 20,000 partition limit. 5 or 1?

I would say "it is partitioned by date, with 5 partitions" because ADD PARTITION is used to add a single directory. Therefore, it would could as 5 towards the 20,000 limit.
The partition is useful if you often use WHERE DT = '2018-xx-xx' statements in your queries, but will not particularly help otherwise. (Splitting data into multiple files does help to parallelize work, but it also comes with an overhead.)

AWS Redshift Distkey and Skew

I came across a situation where I am defining the distkey as the column which is used to join it with other tables (to avoid re-distribution). But that column is not the highest cardinality column, so it leads to skew the data distribution.
Example:
Transaction Table (20M rows)
------------------------------
| user_id | int |
| transaction_id | int |
| transaction_date | date |
------------------------------
Let's say most of the joins performed on this table is on user_id, but transaction_id is higher cardinality column. As 1 user can have multiple transactions.
What should be done in this situation?
Distribute the table on transaction_id column? Even though it will need to re-distributing the data when joined on user_id with another table
Distribute on user_id and let the data be skewed? In my case, the skew factor is ~15 which is way higher than AWS Redshift recommended skew factor of 4.0

As John rightly says you LIKELY want to lean towards improving join performance over data skew but this is based on a ton of likely-true assumptions. I'll itemize a few here:
The distribution (disk-based) skew is on a major fact table
The other tables are also distributed on the join-on key
The joins are usually on the raw tables or group-bys are performed on the dist key
Redshift is a networked cluster and the interconnects between nodes is the lowest bandwidth aspect of the architecture (not low bandwidth, just lower than the other aspects). Move very large amounts of data between nodes is an anti-pattern for Redshift and should be avoided whenever possible.
Disk skew is a measure of where the data is stored around the cluster and without query-based-information only impacts how efficiently the data is stored. The bigger impact of disk skew is execution skew - the the difference in the amount of work each CPU (slice) does when executing a query. Since the first step of every query is for each slice to work on the data it "owns", disk skew leads to some amount of execution skew. How much is dependent on many factors but especially the query in question. Disk skew can lead to issues and in some cases this CAN outweigh redistribution costs. Since slice performance of Redshift is high, execution skew OFTEN isn't the #1 factor driving performance.
Now (nearly) all queries have to perform some amount of data redistribution of data when executing. If you do a group-by of two tables by some non-dist-key column and then join them, there will be redistribution needed to perform the join. The good news is that (hopefully) the amount of data post-group-by will be small so the cost of redistribution will be low. Amount of data being redistributed is what matters.
Dist-key of the tables is only one way to control how much data redistributed. Some ways to do this:
If the dimension tables are dist-style ALL then it doesn't (in basic
cases) matter that your fact table is distributed by user_id - the
data to be joined already exists on the nodes it needs to be on.
You can also control how much data is redistributed by reducing how
much data goes into the join. Having where clauses at the earliest
stage in the query can do this. Denormalizing your data so that
needed where clause columns appear in your fact tables can be a huge
win.
In extreme cases you can make derived dist-key columns that align
perfectly to user_id but also have greatly reduced disk and
execution skew. This is a deeper topic that needs to be in this
answer but can be the answer when you need max performance when
redistribution and skew are in conflict.
A quick word on "ordinality". This is a rule-of-thumb metric that a lot of Redshift documents use as a way to keep new users out of trouble but that can also be explained quickly. It's an (somewhat useful) over-simplification. Higher ordinality is not always better and in the extreme is an anti-pattern - think of a table where each row of the dist-key has a unique value, now think about doing a group-by on some other column for this table. The data skew in this example is perfect but performance of the group-by will suck. You want to distribute the data to speed up what work needs to be done - not improve a metric.

Database Architecture Question - Dimension Creation for Non-Atomic Data

I am looking at an Excel file that will be imported into Power BI. I am not allowed to have access to the database itself due to employment reasons, so they gave me an Excel file to work with that I will then upload into Power BI.
On one of the fact "tables", they have data that looks like this
s-ID success% late% on-time% schedule
1 10% 2% 5% calculus-1;algebra-2
1 5% 10% 27% Calculus-1
1 5% 3% 80% algebra-2
2 33% 50% 3% null
5 5% 34% 8% English-1;English-10;theatre;art
I realize the numbers do not make any sense, but that's basically how the data is structure wise. There are also roughly 100,000 records in this fact "table".
I have a dimension for courses, but I'm not sure how to handle this schedule column. If I split the column vertically, the measure columns will be double counted.
How can I model this and put the schedule into a dimension intelligently in Power-BI?
My goal is as model the data as follows:
Be able to split the schedule into separate rows, but simultaneously not double count all of the values.
I also want to show that the s-ID records have the student taking a
class that has both the calculus-1 and algebra together.
Sometimes the professors schedule 2 classes together into 1 class whenever they are talking about topics that apply to both. There could be 2 classes together, there could be as many as 8 classes together or anything in between.
Is this a scenario where a bridge table would be appropriate?

You can use a bridge table. In a classic dimensional schema, each dimension attached to a fact table has a single value consistent with the fact table’s grain. But there are a number of situations in which a dimension is legitimately multivalued. Like in your example, a student can enroll many courses :

Does "limit" reduce the amount of scanned data on AWS Athena?

I have S3 with compressed JSON data partitioned by year/month/day.
I was thinking that it might reduce the amount of scanned data if construct query with filtering looking something like this:
...
AND year = 2020
AND month = 10
AND day >= 1 "
ORDER BY year, month, day DESC
LIMIT 1
Is this combination of partitioning, order and limit an effective measure to reduce the amount of data being scanned per query?

Partitioning is definitely an effective way to reduce the amount of data that is scanned by Athena. A good article that focuses on performance optimization can be found here: https://aws.amazon.com/de/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/ - and better performance mostly comes from reducing the amount of data that is scanned.
It's also recommended to store the data in a column based format, like Parquet and additionally compress the data. If you store data like that you can optimize queries by just selecting columns you need (there is a difference between select * and select col1,col2,.. in this case).
ORDER BY definitely doesn't limit the data that is scanned, as you need to scan all of the columns in the order by clause to be able to order them. As you have JSON as underlying storage it most likely reads all data.
LIMIT will potentially reduce the amount of data that is read, it depends on the overall size of the data - if limit is way smaller than the overall count of rows it will help.
In general I can recommend to test queries in the Athena interface in AWS - it will tell you the amount of scanned data after a successful execution. I tested on one of my partitioned tables (based on compressed parquet):
partition columns in WHERE clause reduces the amount of scanned data
LIMIT further reduces the amount of scanned data in some cases
ORDER BY leads to reading the all partitions again because it otherwise can't be sorted

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js