I know the differences between Scanning a table with some filters and Querying a table by its sort-key.
I store time-series data on dynamodb tables, the primary-key is formed by device_id and timestamp (partition and sort keys respectively).
I have a table for each month.
I would like to retrieve all results of the past week.
How bad is scanning the current month's table and retrieving all results of the past week? I'm thinking it's not that bad as a quarter of the table are relevant results since a week = 1/4 month.
Given I had some smart indexing, could it be that retrieving O(n) table results be done in o(n)? (little o).
Related
I'm trying to figure out the best way to build a relationship from a table that has records in a daily format (one record represents a single date) to a table that contains records in a date-range format (one record has a start-date and end-date, consequentially representing a period or range of dates).
Since my actual datafiles contain work-related information, I created 2 demo tables that contain dummy data that reflects the date columns in question.
Here is my DailyDate table
Here is my DateRanges table
Here is the current model view
I would like to be able to have a relationship built between the tables so that if I were to have 2 tables/matrices in the Report view, with one table showing the Daily Date Data and the other table showing the Ranged Data Data, I would be able to select a record in the Daily table and Power BI's highlight functionality would filter the records in the Ranged table so that only date ranges containing my selected date appear, and vice-versa (if possible).
For example, referencing this screenshot, if I were to select index 0 in the 'Daily Data Data' table, the 'Ranged Date Data' table should be filtered to only show records with index of 0 in the other table. If I were to select Index 2 (01/03/2022) in the Daily Date Data table then the Ranged Date Data table should be filtered to only show indices (0, 1).
In the model view, when trying to build this relationship, I can create a relationship from DailyDates.Date to DateRanges.StartDate and then from DailyDates.Date to DateRanges.EndDate; however, only a single relationship can be active so the highlight and slicer functionality will not give me the results I'm looking for.
As you can see from this demo, the datasets are small; however, my actual datasets contain around 50 million records in the Daily table and 10+ million records in teh Ranged table, so I'm hoping there can be an efficient method of getting this functionality that will not be too much of a load on memory.
Any advice into how to accomplish this would be greatly appreciated.
I have two tables, Table one has daily data, Table two has weekly data. I've created a start of week column in Table 1 to get weekly data. Data is as shown below :
I want to create a table where I can divide these two measures. Both measures are counts in a week. I want to present this in a line/bar chart with time at the x-axis. Right now when I use the Date of Table 1, My measure 2 takes the overall count as the date of table 2 is not present and vice versa. I was thinking of creating a new Calendar table but I'm unable to get these measure values in that table.
I tried creating a custom calendar table but I'm not getting the desired result. I'm getting correct values from table 2 but no values from table 1. I feel the problem is because table 1 has duplicate date values.
Table 1 actual data before consolidation:Measure is the count of case numbers
I think you need a slight paradigm shift in your thinking, potentially.
Rather than looking for a way to create a third table from two other tables, what you should do is create a relationship between the two tables to make a rational description of how you want these tables to work, and then write the DAX on top of it.
So, in your case, you describe one table having daily data, and the other having weekly. The intermediary calendar table would be a daily calendar, where each day (row) knows the end of week date.
You would then create a relationship from your daily table to the calendar table based on day, and create a second relationship to your weekly table based on end of week date. (assuming bi-directional filtering)
You could then create a measure:
myRatio = DIVIDE(SUM(DailyTable[value]), SUM(weeklyTable[value])
In your chart, you can then show the daily value as a fraction of the weekly value by using the 'Day' field from the calendar table, or you could show the ratio of the complete week from the daily table to the weekly total in the weekly table by using the end of week date in the chart.
If what you truly need is a 3rd table, then you could use the SUMMARIZE() function on this 3 table set to do the summarization into a 3rd table using the same principle.
myNewTable =
SUMMARIZE(calendarTable
,calendarTable[End of Week Date]
,"My Ratio" //the name of the field you want to create
,[My Ratio] //the formula to describe what goes in the field
)
I am currently working on an ETL pipeline that uses BigQuery to store staging data, and then uses Dataprep to transform the data and store it in new BigQuery tables for production.
We have been experiencing issues finding the most cost effective way to apply these transforms on a small selection of the data, typically only the last X number of days from the current max date in the staging data table. For example, we need to calculate the max available date in the staging data, and then retrieve all rows within the past 3 days from this date. Unfortunately we can't rely on the 'max date' in the staging data always being up to date (this data is brought in from third party APIs of varying quality and reliability).
At first I tried applying these transforms directly in Dataprep by getting the max date, creating a comparison column using DATEDIFF and then discarding rows more than 3 days older than this 'max date'. This proved to be very time consuming and inefficient in terms of cost.
The next thing we tried was to filter down the data in BigQuery views, which would then be used as the initial datasets for the Dataprep flows (the data would be pre-filtered before Dataprep applies any transforms). We first tried doing this dynamically in BigQuery, like so:
WITH latest_partitiontime AS (SELECT _PARTITIONTIME as pt FROM
`{project}.{dataset}.{table}`
GROUP BY _PARTITIONTIME
ORDER BY _PARTITIONTIME DESC
LIMIT 1)
SELECT {columns}
FROM `{project}.{dataset}.{table}`
WHERE _PARTITIONTIME >= (SELECT pt FROM latest_partitiontime)
But upon preview of the GB/estimated cost of the query, it seems very inefficient and expensive.
The next thing we tried was hard coding the date, which for some reason is a lot cheaper/quicker:
SELECT {columns}
FROM `{project}.{dataset}.{table}`
WHERE _PARTITIONTIME >= '2018-08-08'
So our current plan is to maintain a view for each table, and update the hard coded date in the view SQL via the Python SDK each time the staging data successfully completes (https://cloud.google.com/bigquery/docs/managing-views).
It feels like we are potentially missing a much easier/more efficient solution to this problem. So I wanted to ask:
Is it more cost effective carrying out this initial filtering by date in Dataprep or in BigQuery?
What is the most cost effective way of filtering the data in the chosen product?
Are you familiar with the MERGE statement of standard SQL and the clustering feature released? that could actually merge your data and you can further customize it to read only some partitions.
Example from manual:
MERGE dataset.DetailedInventory T
USING dataset.Inventory S
ON T.product = S.product
WHEN NOT MATCHED AND quantity < 20 THEN
INSERT(product, quantity, supply_constrained, comments)
VALUES(product, quantity, true, ARRAY<STRUCT<created DATE, comment STRING>>[(DATE('2016-01-01'), 'comment1')])
WHEN NOT MATCHED THEN
INSERT(product, quantity, supply_constrained)
VALUES(product, quantity, false)
hint: you can partition by null, and leverage only the 'clustering level'
In Redshift, the queries are taking too much time to execute. Some queries keep on running or get aborted after some time.
I have very limited knowledge of Redshift and it is getting difficult to understand the Query plan to optimise the query.
Sharing one of the queries that we run, along with the Query Plan.
The query is taking 20 seconds to execute.
Query
SELECT
date_trunc('day',
ti) as date,
count(distinct deviceID) AS COUNT
FROM
live_events
WHERE
brandID = 3927
AND ti >= '2017-08-02T00:00:00+00:00'
AND ti <= '2017-09-02T00:00:00+00:00'
GROUP BY
1
Primary key
brandID
Interleaved Sort Keys
we have set following columns as interleaved sort keys -
brandID, ti, event_name
QUERY PLAN
You have 126 million rows in that table. It's going to take more than a second on a single dc1.large node.
Here's some ways you could improve the performance:
More nodes
Spreading data across more nodes allows more parallelization. Each node adds additional processing and storage. Even if your data volume only justifies one node, if you want more performance, add more nodes.
SORTKEY
For the right type of query, the SORTKEY can be the best way to improve query speed. Sorting data on disk allows Redshift to skip over blocks that it knows does not contain relevant data.
For example, your query has WHERE brandID = 3927, so having brandID as the SORTKEY would make this extremely efficient because very few disk blocks would contain data for one brand.
Interleaved sorting is rarely the best sorting method to use because it is less efficient than a single or compound sort key and takes a long time to VACUUM. If the query you have shown is typical of the type of queries you are running, then use a compound sort key of brandId, ti or ti, brandId. It will be much more efficient.
SORTKEYs are typically a date column, since they are often found in a WHERE clause and the table will be automatically sorted if data is always appended in time order.
The Interleaved Sort would be causing Redshift to read many more disk blocks to find your data, thereby significantly increasing query time.
DISTKEY
The DISTKEY should typically be set to the field that is most used in a JOIN statement on the table. This is because data relating to the same DISTKEY value is stored on the same slice. This won't have such a large impact on a single node cluster, but it is still worth getting right.
Again, you have only shown one type of query, so it is hard to recommend a DISTKEY. Based on this query alone, I would recommend DISTKEY EVEN so that all slices participate in the query. (It is also the default DISTKEY if no specific DISTKEY is selected.) Alternatively, set DISTKEY to a field not shown -- but certainly don't use brandId as the DISTKEY otherwise only one slice will participate in the query shown.
VACUUM
VACUUM your tables regularly so that the data is stored in SORTKEY order and deleted data is removed from storage.
Experiment!
Optimal settings depend upon your data and the queries you typically run. Perform some tests to compare SORTKEY and DISTKEY values and choose the settings that perform the best. Then, test again in 3 months to see if your queries or data has changed enough to make other settings more efficient.
Some time the issue could be due to locks being acquired by other processes. You can refer: https://aws.amazon.com/premiumsupport/knowledge-center/prevent-locks-blocking-queries-redshift/
I'd also like to add that in your query you are performing date transformations. Date operations are expensive in Redshift.
-- This date operation is expensive
date_trunc('day', ti) as date
If you have the luxury you should store the date in the format you need in an additional column.
Does Redshift efficiently (i.e. binary search) find a block of a table that is sorted on a column A for a query with a condition A=?
As an example, let there be a table T with ~500m rows, ~50 fields, distributed and sorted on field A. Field A has high cardinality - so there are ~4.5 m different A values, with exactly the same number of rows in T: ~100 rows per value.
Assume a redshift cluster with a single XL node.
Field A is not compressed. All other fields have some form compression, as suggested by ANALYZE COMPRESSION. A ratio of 1:20 was given compared to an uncompressed table.
Given a trivial query:
select avg(B),avg(C) from
(select B,C from T where A = <val>)
After VACUUM and ANALYZE the following explain plan is given:
XN Aggregate (cost=1.73..1.73 rows=1 width=8)
-> XN Seq Scan on T (cost=0.00..1.23 rows=99 width=8)
Filter: (A = <val>::numeric)
This query takes 39 seconds to complete.
The main question is: Is this the expected behavior of redshift?
According to the documentation at Choosing the best sortkey:
"If you do frequent range filtering or equality filtering on one column, specify that column as the sort key. Redshift can skip reading entire blocks of data for that column because it keeps track of the minimum and maximum column values stored on each block and can skip blocks that don't apply to the predicate range."
In Choosing sort keys:
"Another optimization that depends on sorted data is the efficient handling of range-restricted predicates. Amazon Redshift stores columnar data in 1 MB disk blocks. The min and max values for each block are stored as part of the metadata. If a range-restricted column is a sort key, the query processor is able to use the min and max values to rapidly skip over large numbers of blocks during table scans. For example, if a table stores five years of data sorted by date and a query specifies a date range of one month, up to 98% of the disk blocks can be eliminated from the scan. If the data is not sorted, more of the disk blocks (possibly all of them) have to be scanned. For more information about these optimizations, see Choosing distribution keys."
Secondary questions:
What is the complexity of the aforementioned skipping scan on a sort key? Is it linear ( O(n) ) or some variant of binary search ( O(logn) )?
If a key is sorted - is skipping the only optimization available?
What would this "skipping" optimization look like in the explain plan?
Is the above explain the best one possible for this query?
What is the fastest result redshift can be expected to provide given this scenario?
Does vanilla ParAccel have different behavior in this use case?
This question is answered on amazon forum: https://forums.aws.amazon.com/thread.jspa?threadID=137610