AWS IoT Analytics queries for retrieving data from dataset using boto3 - amazon-web-services

Can we use query while retrieving the data from the dataset in AWS IoT Analytics, I want data between 2 timestamps. Im using boto3 to fetch the data. I didn't see any option to use query in get dataset content Below is the boto3 code:
response = client.get_dataset_content(
datasetName='string',
versionId='string'
)
Does anyone have suggestions how to use query or how rerieve the data between 2 timestamp in AWS IoT Analytics?
Thanks,
Pankaj

There could be a few ways to do this depending on what your workflow is, if you have a few more details, that would be helpful.
Possible approaches are;
1) Create a scheduled query to run every hour (for example) where the query looks something like this;
SELECT * FROM my_datastore WHERE __dt >= current_date - interval '1' day
AND my_timestamp >= now() - interval '1' hour
You may need to adjust the format of the timestamp to suit depending on how you are storing it (epoch seconds, epoch milliseconds, ISO8601 etc. If you set this to run every hour, each time it executes, you will get the last one hour of data. Note that the __dt constraint just helps your query run faster (and cheaper) by limiting the scan to the most recent day only.
2) You can improve on the above by using the delta window function of the dataset which lets you get the data that has arrived since the query last ran more easily. You could then simplify your query to look like;
select * from my_datastore where __dt >= current_date - interval '1' day
And configure the delta time window to look at your timestamp field. You then control how much data is retrieved by the frequency at which you execute the query (every 15 mins, every hour etc).
3) If you have a more general purpose requirement to fetch the data between 2 timestamps that you are calculating programatically, and may not be of the form now() - some interval, the way you could do this is to create a dataset and then update the dataset with the revised SQL expression before running it with create-dataset-content. That way the dataset content is updated with just the results you need with each execution. If this is of interest, I can expand upon the actual python required.
4) As Thomas suggested, it can often be just as easy to pull out a larger chunk of data with the dataset (for example the last day) and then filter down to the timestamp you want in code. This is particularly easy if you are using panda dataframes for example and there are plenty of related questions such as this one that have good answers.

Frankly, the easiest thing would be to do your own time filtering (the result of get_dataset_content is a csv file).
That's what QuickSight does to allow you to navigate the dataset in time.
If this isn't feasible the alternative is to reprocess the datastore with an updated pipeline that filters out everything except the time range you're interested in (more information here). You should note that while it's tempting to use the startTime and endTime parameters for StartPipelineReprocessing, these are only approximate to the nearest hour.

Related

How to fetch data for a specific time period in BigQuery

Please, I created my table using hour time partition. Please, I would like would like to fetch data that was stored in my table in the last X minutes, eg last 5 minutes.
I tried using this command
SELECT *
FROM mydataset.mytable
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 5 MINUTE);
But, it returns a lot more rows than what is expected. I typically store 500 rows every 2 minutes, but this query is returning more than 30000 rows
As #Samuel mentioned in the comment, below example query can be considered to fetch data for a specific time period in BigQuery.
Select * from `dataset.table`
WHERE col_timestamp < TIMESTAMP_SUB(CURRENT_TIMESTAMP(),
INTERVAL 5 MINUTE)
Posting the answer as community wiki for the benefit of the community that might encounter this use case in the future.
Feel free to edit this answer for additional information.

How would one go around creating a due by attribute in redshift

I am currently trying to calculate due by dates in a table by adding the sla time to the time the request was created. From what I am able to understand, the way to go around this is to create a table with the work days and hours and query that table to find the due date. However, redshift does not allow one to declare variables. I was wondering how I would go around creating a work hour table in redshift and if that is not possible, how I would calculate the due date by other means. Thanks!
It appears that you would like to provide a timestamp and then calculate the timestamp that is 'n work hours later', most probably taking into account certain rules such as:
Weekdays: 9am-5pm
Weekends: No Hours
Holidays: Occasional weekdays with No Hours
This could be done by Creating a scalar Python UDF - Amazon Redshift that would be passed a 'start' timestamp and a number of hours, and would return the 'end' timestamp.
Please note that Scalar UDFs cannot access tables or 'call outside' of Redshift, so it would need to be self-contained.
There is code on the web that shows How to find the number of hours between two dates excluding weekends and certain holidays in Python? BusinessHours package - Stack Overflow. You would need to modify such code to specify the duration rather than finding the duration.
The alternate method of "creating a work hour table" would work well when trying to find the number of work hours between two timestamps but would be a bit harder when trying to add workhours to a timestamp.

BigQuery very slow on (seemingly) a very simple query

We using GCP logs which being exported into BigQuery using log sink.
We don't have a huge amount of logs but each record seems to be fairly large.
Running a simple query seem to take a lot of time with BigQuery. We wonder is it normal or are we doing anything wrong... And is there anything we can do to make it a bit more practical to analize...
For example, query
SELECT
FORMAT_DATETIME("%Y-%m-%d %H:%M:%S", DATETIME(timestamp, "Australia/Melbourne")) as Melb_time,
jsonPayload.lg.a,
jsonPayload.lg.p
FROM `XXX.webapp_usg_logs.webapp_*`
ORDER BY timestamp DESC
LIMIT 100
takes
Query complete (44.2 sec elapsed, 35.2 MB processed)
Thank you!
Try adding this to your query:
WHERE _TABLE_SUFFIX > FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))
It will filter to get tables with a TABLE_SUFFIX from within the last 3 days only - instead of having BigQuery look at each table from maybe many years of history.

Most efficient way to filter BigQuery rows by latest date

I am currently working on an ETL pipeline that uses BigQuery to store staging data, and then uses Dataprep to transform the data and store it in new BigQuery tables for production.
We have been experiencing issues finding the most cost effective way to apply these transforms on a small selection of the data, typically only the last X number of days from the current max date in the staging data table. For example, we need to calculate the max available date in the staging data, and then retrieve all rows within the past 3 days from this date. Unfortunately we can't rely on the 'max date' in the staging data always being up to date (this data is brought in from third party APIs of varying quality and reliability).
At first I tried applying these transforms directly in Dataprep by getting the max date, creating a comparison column using DATEDIFF and then discarding rows more than 3 days older than this 'max date'. This proved to be very time consuming and inefficient in terms of cost.
The next thing we tried was to filter down the data in BigQuery views, which would then be used as the initial datasets for the Dataprep flows (the data would be pre-filtered before Dataprep applies any transforms). We first tried doing this dynamically in BigQuery, like so:
WITH latest_partitiontime AS (SELECT _PARTITIONTIME as pt FROM
`{project}.{dataset}.{table}`
GROUP BY _PARTITIONTIME
ORDER BY _PARTITIONTIME DESC
LIMIT 1)
SELECT {columns}
FROM `{project}.{dataset}.{table}`
WHERE _PARTITIONTIME >= (SELECT pt FROM latest_partitiontime)
But upon preview of the GB/estimated cost of the query, it seems very inefficient and expensive.
The next thing we tried was hard coding the date, which for some reason is a lot cheaper/quicker:
SELECT {columns}
FROM `{project}.{dataset}.{table}`
WHERE _PARTITIONTIME >= '2018-08-08'
So our current plan is to maintain a view for each table, and update the hard coded date in the view SQL via the Python SDK each time the staging data successfully completes (https://cloud.google.com/bigquery/docs/managing-views).
It feels like we are potentially missing a much easier/more efficient solution to this problem. So I wanted to ask:
Is it more cost effective carrying out this initial filtering by date in Dataprep or in BigQuery?
What is the most cost effective way of filtering the data in the chosen product?
Are you familiar with the MERGE statement of standard SQL and the clustering feature released? that could actually merge your data and you can further customize it to read only some partitions.
Example from manual:
MERGE dataset.DetailedInventory T
USING dataset.Inventory S
ON T.product = S.product
WHEN NOT MATCHED AND quantity < 20 THEN
INSERT(product, quantity, supply_constrained, comments)
VALUES(product, quantity, true, ARRAY<STRUCT<created DATE, comment STRING>>[(DATE('2016-01-01'), 'comment1')])
WHEN NOT MATCHED THEN
INSERT(product, quantity, supply_constrained)
VALUES(product, quantity, false)
hint: you can partition by null, and leverage only the 'clustering level'

Add column with difference in days

i'm trying the new Power BI (Desktop) to create a barchart that shows me the duration in days for the delivery of an order.
I have 2 files. 1 with the delivery data (date, barcode) and another file with the deliverystatusses (date, barcode).
I Created a relation in the powerBI relations tab on the left side to create a relation on barcode. 1 Delivery to many DeliveryStatusses.
Now I want to add a column/measure to calculate the number of days before a package is delivered. I searched a few blogs but with no succes.
The function DATEDIFF is only recognized in a measure, and measures seem to work on table date, not rowdata. So adding a column using the DATEDIFF function doesn't work.
Adding a column using a formula :
Duration = [DeliveryDate] - Delivery[OrderDate]
results in an error that the right side is a list (It seems the relationship isn't in place)?
What am I doing wrong?
You might try doing this in the Query window instead since I think each barcode has just one delivery date and one delivery status. You could merge the two queries into a single table. Then you wouldn't need to worry about the relationships... If on the other hand you can have multiple lines for each delivery in the delivery status table, then you need to get more fancy. If you're only interested in the last status (as opposed to the history of status) you could again use the Query windows to group the data. If you need the full flexibility, you'd probably need to create a Measure that expresses the logic you want.
The RELATED keyword is used to reference another table. Update your query as follows and it should work.
Like this:
Duration = [DeliveryDate] - RELATED(Delivery[OrderDate])