Is there a good way to import unsorted data into QuestDB in bulk? - questdb

I have a 100GB data file with sensor readings spanning a few weeks. The timestamps are not in strict chronological order and I'd like to bulk load the data into QuestDB. The order is not completely random, but there is a deviation of up to three minutes of lateness where some rows are 3 minutes late.
Is there an efficient way to do bulk loading like this and ensure that the data is ordered chronologically at the same time?

The most efficient way to do this is in a 3-step phase
Import the unordered dataset, you can do this via curl:
curl -F data=#unordered-data.csv 'http://localhost:9000/imp'
Create a table with the schema of the imported data and apply a partitioning
strategy. The
timestamp column may be cast as a timestamp if auto detection of the timestamp failed:
CREATE TABLE ordered AS (
SELECT
cast(timestamp AS timestamp) timestamp,
col1,
col2
FROM 'unordered-data.csv' WHERE 1 != 1
) timestamp(timestamp) PARTITION BY DAY;
Insert the unordered records into the partitioned table and provide a lag
and batch size:
INSERT batch 100000 lag 180000000 INTO ordered
SELECT
cast(timestamp AS timestamp) timestamp,
col1,
col2
FROM 'unordered-data.csv';
To confirm that the table is ordered, the isOrdered() function may be used:
select isOrdered(timestamp) from ordered
isOrdered
true
There is more info on loading data in this way on the CSV import documentation
lag can be about 3 minutes in your case, it's the expected lateness of records
batch is the number of records to batch process at one time

Related

How to backfill partitioned data in BigQuery?

I am trying to backfill data from GCP billing export table to another table say T1.
Both tables are partitioned.
Below scheduled query runs everyday to get yesterday’s data.
SELECT * FROM gcp_billing_export_v1 WHERE DATE(_PARTITIONTIME) = DATE_ADD(CURRENT_DATE(), INTERVAL -1 DAY)
Now I need to backfill the data , say for 15th May - how do I do that ?
I tried the backfill feature with the below query - expecting the backfill utility to take the past date i.e. May 15th as a param for the #run_date but that didn’t help.
SELECT * FROM gcp_billing_export_v1 WHERE DATE(_PARTITIONTIME) = #run_date
The data is pulled for 15th May from the source table(gcp_billing_export_v1) but is populated against current date in the destination table i.e May 15th May data is populated against June 22nd in the destination table T1. Where am I going wrong ?
Any guidance ?
Looks like you're using ingestion partitioning.
You would need to create a new table with the partitioning you want ie EventDate and populate that new table with historical and new daily data - as you can't overwrite an existing partition.
Link here: https://cloud.google.com/bigquery/docs/querying-partitioned-tables#query_an_ingestion-time_partitioned_table
As #Lemon already pointed out that you're using Ingestion time partitioned tables(both source and dest), you need to understand how it works. Ingestion time partitioned tables are different from the Regular partitioned tables.
From the Documentation-
When you create a table partitioned by ingestion time,BigQuery automatically assigns rows to partitions based on the time when BigQuery ingests the data.
This type of table has a pseudo-column named _PARTITIONTIME.The value of this column is the ingestion time for each row.
Since you are using the SELECT * FROM gcp_billing_export_v1 you are getting all the data but without any _PARTITIONTIME column. And when you are saving the same result into the destination table , it is updating the _PARTITIONTIME column as per destintaion table's data ingestion-time.
Thus you have old data with the current date in _PARTITIONTIME
To avoid this your destination table needs to be either a normal table or regular partitioned table.
Also you need to have an extra column to hold the Datetime value from the source's_PARTITIONTIME column.You can create a regular partition on this new column.
Then to get _PARTITIONTIME in your result set , in your quer you need to mention the column name specifically in your query.
SELECT *,_PARTITIONTIME AS ingestionTime
FROM gcp_billing_export_v1
WHERE DATE(_PARTITIONTIME) = #run_date
The above query will return all the data from the gcp_billing_export_v1 table with 1 extra column ingestionTime.
Now you can backfill the data for 15th,May and save it to the new table.
You can also tweak around this below query to achieve the same
SELECT *,_PARTITIONTIME AS ingestionTime
FROM gcp_billing_export_v1
WHERE DATE(_PARTITIONTIME) = DATE_ADD(#run_date, INTERVAL -1 DAY)
It will run daily as per your need .Now if you want to pull data for 15th,May then you have to schedule the backfill for 16th,May(as per the where clause)

AWS Athena query on parquet file - using columns in where clause

We are planning to use Athena as a backend service for our data(stored as parquet files in partitions) in S3.
Some of the things we are interested to find out is how does adding additional columns in where clause of the query affect the query run time.
For example, we have 10million records in one hive partition(partition based on column 'date')
And all queries below return same volume - 10million. would all these queries take same time or does it reduce query run when we add additional columns in where clause(as parquet is columnar fomar)?
I tried to test this but results were not consistent as there was some queuing time as well I guess
select * from table where date='20200712'
select * from table where date='20200712' and type='XXX'
select * from table where date='20200712' and type='XXX' and subtype='YYY'
Parquet file contains page "indexes" (min, max and bloom filters.) If you sorting the data by columns in question during insert for example like this:
insert overwrite table mytable partition (dt)
select col1, --some columns
type,
subtype,
dt
distribute by dt
sort by type, subtype
then these indexes may work efficiently because data withe the same type, subtype will be loaded into the same pages, data pages will be selected using indexes. See some benchmarks here: https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
Switch-on predicate-push-down: https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cdh_ig_predicate_pushdown_parquet.html

Athena: Minimize data scanned by query including JOIN operation

Let there be an external table in Athena which points to a large amount of data stored in parquet format on s3. It contains a lot of columns and is partitioned on a field called 'timeid'. Now, there's another external table (small one) which maps timeid to date.
When the smaller table is also partitioned on timeid and we join them on their partition id (timeid) and put date into where clause, only those specific records are scanned from large table which contain timeids corresponding to that date. The entire data is not scanned here.
However, if the smaller table is not partitioned on timeid, full data scan takes place even in the presence of condition on date column.
Is there a way to avoid full data scan even when the large partitioned table is joined with an unpartitioned small table? This is required because the small table contains only one record per timeid and it might not be expected to create a separate file for each.
That's an interesting discovery!
You might be able to avoid the large scan by using a sub-query instead of a join.
Instead of:
SELECT ...
FROM large-table
JOIN small-table
WHERE small-table.date > '2017-08-03'
you might be able to use:
SELECT ...
FROM large-table
WHERE large-table.date IN
(SELECT date from small-table
WHERE date > '2017-08-03')
I haven't tested it, but that would avoid the JOIN you mention.

Should I use a column both as distkey and sortkey

I have a table in redshift containing billion records ( log file entries ). It has a timestamp column ts on which I have distkey and sortkey. Following query:
select ts from apilogs where date(ts) = '2016-09-08' limit 10;
runs super fast when I query for old date; but not for latest date ! Not sure why ! Any help is appreciated
How I put logs: I had put all old log files in one shot into this table; while every incremental log files I put hourly.
When I checked detailed plan on AWS console; I can see that query taking long time is scanning all billion rows; while query taking few milliseconds is scanning only few thousands of rows ( i.e. rows corresponding to that date )..
So, now question is why it is scanning whole table for latest timestamp !
Dist key and sort key can be on the same column. No Problem!
Your latest data load in your log table, was it sorted according to the sort key? If not, you will have to run vacuum on your log table, so that your sort key column is sorted in that order and redshift does not have to scan unnecessary rows.
Run the below query to check if you have any unsorted region in your table.
select trim(pgdb.datname) as Database,
trim(a.name) as Table, ((b.mbytes/part.total::decimal)*100)::decimal(5,2) as pct_of_total, b.mbytes, b.unsorted_mbytes, (unsorted_mbytes/mbytes::decimal)*100 as unsorted_pct
from stv_tbl_perm a
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, sum(decode(unsorted, 1, 1, 0)) as unsorted_mbytes, count(*) as mbytes
from stv_blocklist group by tbl) b on a.id=b.tbl
join ( select sum(capacity) as total
from stv_partitions where part_begin=0 ) as part on 1=1
where a.slice=0 and a.name in ('apilogs')
order by 3 desc, db_id, name;
If you have unsorted region, run
Vacuum apilogs to 100 percent
It seems like you haven't run vacuum on your table, after you added rows for latest timestamp.
Here's the part, most relevant to your use case, from the Redshift documentation:
When data is initially loaded into a table that has a sort key, the data is sorted according to the SORTKEY specification in the CREATE TABLE statement. However, when you update the table, using COPY, INSERT, or UPDATE statements, new rows are stored in a separate unsorted region on disk, then sorted on demand for queries as required. If large numbers of rows remain unsorted on disk, query performance might be degraded for operations that rely on sorted data, such as range-restricted scans or merge joins. The VACUUM command merges new rows with existing sorted rows, so range-restricted scans are more efficient and the execution engine doesn't need to sort rows on demand during query execution.
P.S.- You shouldn't be worried about your distribution key here, as they come into picture only during joins.

Redshift UPDATE prohibitively slow

I have a table in a Redshift cluster with ~1 billion rows. I have a job that tries to update some column values based on some filter. Updating anything at all in this table is incredibly slow. Here's an example:
SELECT col1, col2, col3
FROM SOMETABLE
WHERE col1 = 'a value of col1'
AND col2 = 12;
The above query returns in less than a second, because I have sortkeys on col1 and col2. There is only one row that meets this criteria, so the result set is just one row. However, if I run:
UPDATE SOMETABLE
SET col3 = 20
WHERE col1 = 'a value of col1'
AND col2 = 12;
This query takes an unknown amount of time (I stopped it after 20 minutes). Again, it should be updating one column value of one row.
I have also tried to follow the documentation here: http://docs.aws.amazon.com/redshift/latest/dg/merge-specify-a-column-list.html, which talks about creating a temporary staging table to update the main table, but got the same results.
Any idea what is going on here?
You didn't mention what percentage of the table you're updating but it's important to note that an UPDATE in Redshift is a 2 step process:
Each row that will be changed must be first marked for deletion
Then a new version of the data must be written for each column in the table
If you have a large number of columns and/or are updating a large number of rows then this process can be very labor intensive for the database.
You could experiment with using a CREATE TABLE AS statement to create a new "updated" version of the table and then dropping the existing table and renaming the new table. This has the added benefit of leaving you with a fully sorted table.
Actually I don't think RedShift is designed for bulk updates, RedShift is designed for OLAP instead of OLTP, update operations are inefficient on RedShift by nature.
In this use case, I would suggest to do INSERT instead of UPDATE, while add another column of the TIMESTAMP, and when you do analysis on RedShift, you'll need extra logic to get the latest TIMESTAMP to eliminate possible duplicated data entries.