I observe linear time increase with number of rows in the table when I query LATEST BY symbol in QuestDB as if it does full scan to find the values. Here is my table
CREATE TABLE metric(
ObjectType SYMBOL capacity 2 cache ,
ObjectId SYMBOL capacity 20000 cache,
Group SYMBOL capacity 4000 cache,
Region SYMBOL capacity 20 cache,
CC Symbol capacity 50 cache,
value DOUBLE,
timestamp TIMESTAMP
)
timestamp(timestamp)
PARTITION BY DAY;
And query is
select value from metric
LATEST BY ObjectType
where objectType= 'queue'
I'd expect linear or logarithmic time growth for it.
An index is needed to avoid full table scan and to have linear LATEST BY SELECT query performance
Try
ALTER table metric ALTER COLUMN objectType ADD INDEX
or you should create table
CREATE TABLE metric(
ObjectType SYMBOL capacity 2 cache index,
ObjectId SYMBOL capacity 20000 cache,
Group SYMBOL capacity 4000 cache,
Region SYMBOL capacity 20 cache,
CC Symbol capacity 50 cache,
value DOUBLE,
timestamp TIMESTAMP
)
timestamp(timestamp)
PARTITION BY DAY;
Symbol by itself is not indexed it just merely means that repeated string values will be saved as integers in a row with a separate lookup table (symbol dictionary) translating from integer to string.
Related
I ran this:
EXPLAIN select id, birth_date, ROW_NUMBER() OVER (ORDER BY 1) AS load_id from user_profile;
and I see this:
WindowAgg (cost=0.00..133833424.40 rows=30901176 width=36)
-> Seq Scan on user_profile (cost=0.00..133369906.76 rows=30901176 width=28)
What does this query plan mean?
The query plan is the execution plan that the PostgreSQL planner (Amazon Redshift is based on PostgreSQL) has generated for the your SQL statement.
The first node is a window aggregation (WindowAgg) over the data as you're using the OVER window function to calculate a row number.
The second node is a sequential scan (Seq Scan) on the user_profile table, as you're doing a full select of the table without any filtering.
A sequential scan scans the entire table as stored on disk since your query requires a full traversal of the table. Even if there is a multi-column index on id & birth_date, the query engine would pretty much always go for a sequence scan here as you need everything (depending on the random_page_cost & enable_seqscan parameters in PostgreSQL).
The cost number is actually arbitrary, but conventionally means the number of disk page fetches; it's split into 2 values with the delimiter being ...
The first value shows the startup cost - this is the estimated cost to return the first row. The second value shows the total cost - this is the estimated cost to return all rows.
For example, for the Seq Scan, the startup cost is 0 and the total cost is estimated to be 133369906.76.
For sequential scans, the startup cost is usually 0. There's nothing really to do other than return data so it can start returning data right away. Total costs for a node includes the cost of all its child nodes as well - in this case, the final total cost of both operations looks to be 133833424.40 which is the sum of the scan and aggregation cost.
The rows value demonstrates the estimated number of rows that will be returned. In this case, both operations have the same value as the aggregation will apply to all rows & no filtering is being carried out that will reduce the number of final rows.
The width value demonstrates the estimated size in bytes of each returned row i.e. each row will most likely be 28 bytes in length before the aggregation and 36 bytes after the aggregation.
Putting that all together, you could read the query plan as such:
Sequential Scan on table user_profile
will most likely start returning rows immediately
estimated disk page fetch count of 133369906.76
estimated 30,901,176 rows to be returned
estimated total row size of 28 bytes
Window Aggregation on data from above operation
will most likely start returning rows immediately
estimated disk page fetch count of 133833424.40
estimated 30,901,176 rows to be returned
estimated total row size of 36 bytes
When I try to crate table with index
CREATE TABLE NEW AS( SELECT DISTINCT * FROM OLD),
index(RIC capacity 1000000)
PARTITION BY MONTH
I get back error
io.questdb.cairo.CairoException: [2] No space left on device [need=99472113664]
I have 800+Gb free on the filesystem and able OLD is not particularly big, few Gb on disk. Any idea why I have the error?
Index capacity is how many rows you expect per each symbol value on average. If you specify capacity of 1 million QuestDB will allocate 7Mb of data per each symbol value. If you try to insert 150k of distinct symbol values table will try allocate 1TB of space.
If you have few distinct symbol values and many rows per each of them then you increase index capacity. If you have many distinct symbols then you increase symbol capacity.
I have a 100GB data file with sensor readings spanning a few weeks. The timestamps are not in strict chronological order and I'd like to bulk load the data into QuestDB. The order is not completely random, but there is a deviation of up to three minutes of lateness where some rows are 3 minutes late.
Is there an efficient way to do bulk loading like this and ensure that the data is ordered chronologically at the same time?
The most efficient way to do this is in a 3-step phase
Import the unordered dataset, you can do this via curl:
curl -F data=#unordered-data.csv 'http://localhost:9000/imp'
Create a table with the schema of the imported data and apply a partitioning
strategy. The
timestamp column may be cast as a timestamp if auto detection of the timestamp failed:
CREATE TABLE ordered AS (
SELECT
cast(timestamp AS timestamp) timestamp,
col1,
col2
FROM 'unordered-data.csv' WHERE 1 != 1
) timestamp(timestamp) PARTITION BY DAY;
Insert the unordered records into the partitioned table and provide a lag
and batch size:
INSERT batch 100000 lag 180000000 INTO ordered
SELECT
cast(timestamp AS timestamp) timestamp,
col1,
col2
FROM 'unordered-data.csv';
To confirm that the table is ordered, the isOrdered() function may be used:
select isOrdered(timestamp) from ordered
isOrdered
true
There is more info on loading data in this way on the CSV import documentation
lag can be about 3 minutes in your case, it's the expected lateness of records
batch is the number of records to batch process at one time
I know the differences between Scanning a table with some filters and Querying a table by its sort-key.
I store time-series data on dynamodb tables, the primary-key is formed by device_id and timestamp (partition and sort keys respectively).
I have a table for each month.
I would like to retrieve all results of the past week.
How bad is scanning the current month's table and retrieving all results of the past week? I'm thinking it's not that bad as a quarter of the table are relevant results since a week = 1/4 month.
Given I had some smart indexing, could it be that retrieving O(n) table results be done in o(n)? (little o).
Does Redshift efficiently (i.e. binary search) find a block of a table that is sorted on a column A for a query with a condition A=?
As an example, let there be a table T with ~500m rows, ~50 fields, distributed and sorted on field A. Field A has high cardinality - so there are ~4.5 m different A values, with exactly the same number of rows in T: ~100 rows per value.
Assume a redshift cluster with a single XL node.
Field A is not compressed. All other fields have some form compression, as suggested by ANALYZE COMPRESSION. A ratio of 1:20 was given compared to an uncompressed table.
Given a trivial query:
select avg(B),avg(C) from
(select B,C from T where A = <val>)
After VACUUM and ANALYZE the following explain plan is given:
XN Aggregate (cost=1.73..1.73 rows=1 width=8)
-> XN Seq Scan on T (cost=0.00..1.23 rows=99 width=8)
Filter: (A = <val>::numeric)
This query takes 39 seconds to complete.
The main question is: Is this the expected behavior of redshift?
According to the documentation at Choosing the best sortkey:
"If you do frequent range filtering or equality filtering on one column, specify that column as the sort key. Redshift can skip reading entire blocks of data for that column because it keeps track of the minimum and maximum column values stored on each block and can skip blocks that don't apply to the predicate range."
In Choosing sort keys:
"Another optimization that depends on sorted data is the efficient handling of range-restricted predicates. Amazon Redshift stores columnar data in 1 MB disk blocks. The min and max values for each block are stored as part of the metadata. If a range-restricted column is a sort key, the query processor is able to use the min and max values to rapidly skip over large numbers of blocks during table scans. For example, if a table stores five years of data sorted by date and a query specifies a date range of one month, up to 98% of the disk blocks can be eliminated from the scan. If the data is not sorted, more of the disk blocks (possibly all of them) have to be scanned. For more information about these optimizations, see Choosing distribution keys."
Secondary questions:
What is the complexity of the aforementioned skipping scan on a sort key? Is it linear ( O(n) ) or some variant of binary search ( O(logn) )?
If a key is sorted - is skipping the only optimization available?
What would this "skipping" optimization look like in the explain plan?
Is the above explain the best one possible for this query?
What is the fastest result redshift can be expected to provide given this scenario?
Does vanilla ParAccel have different behavior in this use case?
This question is answered on amazon forum: https://forums.aws.amazon.com/thread.jspa?threadID=137610