How to upload CSV file with QuestDB with different time precision? - questdb

I have large CSVs with trade data in it and I want to upload them to QuestDB. One of the column is the transaction timestamp and I want it to be the designated timestamp on the column. The data in CSV has this transaction time sometimes with seconds precision and sometimes in microseconds e.g.
Transaction Timestamp
2021-01-03 12:45:56.234567
2021-01-03 12:45:57
Any way to upload this CSV with out rewriting the rows? I found how to set column pattern on upload here https://questdb.io/docs/reference/api/rest but it seems to be impossible to have 2 slightly different patterns or have an optional part (microseconds in my case are optional)

Related

How can I speed up this Athena Query?

I am running a query through the Athena Query Editor on a table in the Glue Data Catalog and would like to understand why it takes so long to do a simple select * from this data.
Our data is stored in an S3 bucket that is partitioned by year/month/day/hour, with 80 snappy Parquet files per partition that are anywhere between 1 - 10 MB in size each. When I run the following query:
select stringA, stringB, timestampA, timestampB, bigintA, bigintB
from tableA
where year='2021' and month='2' and day = '2'
It scans 700MB but takes over 3 minutes to display the Athena results. I feel that we have already optimized the file format and partitioning for this data, and so I am unsure how else we can improve the performance if we're just trying to select this data out and display it in a tool like QuickSight.
The select * performance was impacted by the number of files that needed to be scanned, which were all relatively small. Repartitioning and removing the hour partition resulted in an improvement in both runtime (14% reduction) and also data scanned (26% reduction) due to snappy compression getting more gains on larger files.
Source: https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

Best partitioning method for multiple devices and timestamps

In my organisation we have multiple devices sending data every seconds. The data is processed and partitioned in AWS S3 like this /year=YYYY/month=MM/day=DD/file.csv.
Using AWS Athena we use to run queries like this: SELECT col1, col2, coln FROM data WHERE year = 'YYYY' AND month = 'MM' and DAY = 'dd' AND device_id = 123 to retrieve data from one device for some time in a day. Sometimes we also need to get data from multiple devices (device_id IN (...)) and at different times. Note that the columns device_id and ts exist in the dataset and only ts is used to generate partitions.
Here's my question:
Will this method of partitioning be efficient in a long term ? At this time, we only have about 150 active devices, but we plan to scale at 1000 and more. Considering the fact that the query schema would be the same (get data for some device at a certain time), is it better to partition by device_id and then by date (/devive_id/year=YYYY/month=MM/day=DD/file.csv) ?
The partitioning is very good for your supplied query -- it will only need to look in one subdirectory for that single day of data.
However, if you were querying for a specific device across all time (without specifying a month/day), then it would not be efficient.
You will need to decide what is going to be more common:
If a specific device will always be queried, then partition by Device, then Date
If a specific day/month will always be queried, then your current method is fine (possibly with an additional partition of device after Day)

Does "limit" reduce the amount of scanned data on AWS Athena?

I have S3 with compressed JSON data partitioned by year/month/day.
I was thinking that it might reduce the amount of scanned data if construct query with filtering looking something like this:
...
AND year = 2020
AND month = 10
AND day >= 1 "
ORDER BY year, month, day DESC
LIMIT 1
Is this combination of partitioning, order and limit an effective measure to reduce the amount of data being scanned per query?
Partitioning is definitely an effective way to reduce the amount of data that is scanned by Athena. A good article that focuses on performance optimization can be found here: https://aws.amazon.com/de/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/ - and better performance mostly comes from reducing the amount of data that is scanned.
It's also recommended to store the data in a column based format, like Parquet and additionally compress the data. If you store data like that you can optimize queries by just selecting columns you need (there is a difference between select * and select col1,col2,.. in this case).
ORDER BY definitely doesn't limit the data that is scanned, as you need to scan all of the columns in the order by clause to be able to order them. As you have JSON as underlying storage it most likely reads all data.
LIMIT will potentially reduce the amount of data that is read, it depends on the overall size of the data - if limit is way smaller than the overall count of rows it will help.
In general I can recommend to test queries in the Athena interface in AWS - it will tell you the amount of scanned data after a successful execution. I tested on one of my partitioned tables (based on compressed parquet):
partition columns in WHERE clause reduces the amount of scanned data
LIMIT further reduces the amount of scanned data in some cases
ORDER BY leads to reading the all partitions again because it otherwise can't be sorted

Athena Partition Projection Not Working As Expected

I am moving from registered partitions to partition projection.
Previously my data was partitioned by p_year={yyyy}/p_month={MM}/p_day={dd}/p_hour={HH}/... and I am moving these to p_date={yyyy}-{MM}-{dd} {HH}:00:00/..
I have a recent events table that stores the last 2 days worth of events. And so my p_date range is NOW-2DAYS,NOW. The full table parameters are-
projection.enabled: 'True'
projection.p_date.type: 'date'
projection.p_date.range: NOW-2DAYS,NOW
projection.p_date.format: 'yyyy-MM-dd HH:mm:ss'
projection.p_date.interval: 1
projection.p_date.interval.unit: 'HOURS'
But when I try to query this, I get no results.
SELECT COUNT(*) FROM recent_events_2d_v2
> 0
However, If I change the date range to 2020-09-01 00:00:00,NOW I do get results.
Something seems off with the relative date ranges with partition projection. Can anyone see what I may be doing wrong, or is this a bug?
You need to change your date format to 'yyyy-MM-dd HH:\'00:00\'' (i.e. literal "00:00" instead of minutes and seconds placeholders).
The way partition projection deals with dates leaves some things to be desired. It seems reasonable that if you say the interval is one hour that the timestamps get rounded to the nearest hour, but that's not what happens. Athena will use the actual "now" to generate the partition values, and if your date format contains fields for minutes and seconds, those will be filled in too.
I assume the reason why it worked when you used a hard coded timestamp is that Athena uses that value as the seed for the sequence, and all other timestamps will also be aligned to the hour.
If you are sure your bucket p_date={yyyy}-{MM}-{dd} {HH}:00:00/.. contains data, then you need to make sure that partitions are correctly loaded. Try running
MSCK REPAIR TABLE recent_events_2d_v2
and rerun the query.

AWS IoT Analytics queries for retrieving data from dataset using boto3

Can we use query while retrieving the data from the dataset in AWS IoT Analytics, I want data between 2 timestamps. Im using boto3 to fetch the data. I didn't see any option to use query in get dataset content Below is the boto3 code:
response = client.get_dataset_content(
datasetName='string',
versionId='string'
)
Does anyone have suggestions how to use query or how rerieve the data between 2 timestamp in AWS IoT Analytics?
Thanks,
Pankaj
There could be a few ways to do this depending on what your workflow is, if you have a few more details, that would be helpful.
Possible approaches are;
1) Create a scheduled query to run every hour (for example) where the query looks something like this;
SELECT * FROM my_datastore WHERE __dt >= current_date - interval '1' day
AND my_timestamp >= now() - interval '1' hour
You may need to adjust the format of the timestamp to suit depending on how you are storing it (epoch seconds, epoch milliseconds, ISO8601 etc. If you set this to run every hour, each time it executes, you will get the last one hour of data. Note that the __dt constraint just helps your query run faster (and cheaper) by limiting the scan to the most recent day only.
2) You can improve on the above by using the delta window function of the dataset which lets you get the data that has arrived since the query last ran more easily. You could then simplify your query to look like;
select * from my_datastore where __dt >= current_date - interval '1' day
And configure the delta time window to look at your timestamp field. You then control how much data is retrieved by the frequency at which you execute the query (every 15 mins, every hour etc).
3) If you have a more general purpose requirement to fetch the data between 2 timestamps that you are calculating programatically, and may not be of the form now() - some interval, the way you could do this is to create a dataset and then update the dataset with the revised SQL expression before running it with create-dataset-content. That way the dataset content is updated with just the results you need with each execution. If this is of interest, I can expand upon the actual python required.
4) As Thomas suggested, it can often be just as easy to pull out a larger chunk of data with the dataset (for example the last day) and then filter down to the timestamp you want in code. This is particularly easy if you are using panda dataframes for example and there are plenty of related questions such as this one that have good answers.
Frankly, the easiest thing would be to do your own time filtering (the result of get_dataset_content is a csv file).
That's what QuickSight does to allow you to navigate the dataset in time.
If this isn't feasible the alternative is to reprocess the datastore with an updated pipeline that filters out everything except the time range you're interested in (more information here). You should note that while it's tempting to use the startTime and endTime parameters for StartPipelineReprocessing, these are only approximate to the nearest hour.