How to fetch data for a specific time period in BigQuery

How to fetch data for a specific time period in BigQuery - google-cloud-platform

Please, I created my table using hour time partition. Please, I would like would like to fetch data that was stored in my table in the last X minutes, eg last 5 minutes.
I tried using this command
SELECT *
FROM mydataset.mytable
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 5 MINUTE);
But, it returns a lot more rows than what is expected. I typically store 500 rows every 2 minutes, but this query is returning more than 30000 rows

As #Samuel mentioned in the comment, below example query can be considered to fetch data for a specific time period in BigQuery.
Select * from `dataset.table`
WHERE col_timestamp < TIMESTAMP_SUB(CURRENT_TIMESTAMP(),
INTERVAL 5 MINUTE)
Posting the answer as community wiki for the benefit of the community that might encounter this use case in the future.
Feel free to edit this answer for additional information.

Related

Power BI: Relative Time Under 5 Hours returns no data

I have a PBI desktop dashboard I've created to pull machine data from a local SQL server. I'm using a relative date time filter on one of the pages to drill down data for live feed, however anything under 5 hours of the relative time, the data goes blank.
I use 4 log tables for the raw data, each having their own time stamp for each instance. Each are related using a ID table with other general information contained. In addition, time is related using a calculated table to create a timeframe of all instances:
Relationship Model
DateTable = distinct(union(SUMMARIZE(LogFault,LogFault[Time]),SUMMARIZE(LogGood,LogGood[Time]),SUMMARIZE(LogReject,LogReject[Time]),SUMMARIZE(LogState,LogState[Time])))
5 Hours Relative Time
4 hours relative time
As you can see from the top right of the images, not even the times are pulled to the page. Is there a limitation to PBI on the relative time function? This wouldn't make sense to me if there is a "minutes" option under relative time. Any feedback on this would be appreciated.

For those looking in the future, unfortunately PowerBI desktop, along with service, appears to only like to work in the UTC time zone. So the relative date/time was filtering based on the UTC time zone, not my time zone (EST). In order to resolve this, I had to create a new calculated column next to my distinct time stamps to correct for the time zone. I then used the adjusted time for the relative time filtering, but the charts remained under the original time stamps.
UTC to EST time zone adjust
UTC_AdjustTZ = FORMAT(DateTable[Time]+TIME(4,0,0),"General Date")
Chart Example after adjust
Chart after fix implemented

Probably because your filter on Date Table doesn't reach the destined table. Normally filter moves from one side to many side, then one side to many side in a chain of relationships; but
In your case for example:
Filter goes from Date Table to Log Reject then It can't move to RejectDefinitions because of the filter direction. You have 2 options here:
1) Change the model relationships : Make Log Reject(One side) and RejectDefinitions(Many side) if It is possible.
OR
2) Set the filter direction as Both in the model.
You need to do this for all the remaining log tables(LogFault-FaultDefinitions,Logstate-StateDefinitions)
I hope It solves your problem. Please check that your model is not ambiguous after making those changes.

BigQuery very slow on (seemingly) a very simple query

We using GCP logs which being exported into BigQuery using log sink.
We don't have a huge amount of logs but each record seems to be fairly large.
Running a simple query seem to take a lot of time with BigQuery. We wonder is it normal or are we doing anything wrong... And is there anything we can do to make it a bit more practical to analize...
For example, query
SELECT
FORMAT_DATETIME("%Y-%m-%d %H:%M:%S", DATETIME(timestamp, "Australia/Melbourne")) as Melb_time,
jsonPayload.lg.a,
jsonPayload.lg.p
FROM `XXX.webapp_usg_logs.webapp_*`
ORDER BY timestamp DESC
LIMIT 100
takes
Query complete (44.2 sec elapsed, 35.2 MB processed)
Thank you!

Try adding this to your query:
WHERE _TABLE_SUFFIX > FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))
It will filter to get tables with a TABLE_SUFFIX from within the last 3 days only - instead of having BigQuery look at each table from maybe many years of history.

What Calculation can I use to count the number of records per day over a timeline?

I'm wanting to show the difference between Connections and Disconnects over a timeline. If I can create a measure that stores the number of connects or disconnects each day I can subtract them from each other to get the difference.
For an example I'm counting how many records have the same disconnect date. I've tried working with the count function but I'm not sure how to count the number of records for each day.
At this point I'm not sure I have meaningful code to share. :(
A solution would hopefully be able to produce a table that shows the connect or disconnect date along with how many records have that same date.
The meaningful fields would be the Acct and the Disc_dt or Conn_dt.
All accounts will have a connect date but may not have a disconnect date.
ex.
DATE - COUNT
01/01/2018 - 3
02/01/2018 - 5
03/01/2018 - 4

Please have a look at the following question:
How to calculate daily population in DAX
It will give you a solution you can use also if I understand you correct.

AWS IoT Analytics queries for retrieving data from dataset using boto3

Can we use query while retrieving the data from the dataset in AWS IoT Analytics, I want data between 2 timestamps. Im using boto3 to fetch the data. I didn't see any option to use query in get dataset content Below is the boto3 code:
response = client.get_dataset_content(
datasetName='string',
versionId='string'
)
Does anyone have suggestions how to use query or how rerieve the data between 2 timestamp in AWS IoT Analytics?
Thanks,
Pankaj

There could be a few ways to do this depending on what your workflow is, if you have a few more details, that would be helpful.
Possible approaches are;
1) Create a scheduled query to run every hour (for example) where the query looks something like this;
SELECT * FROM my_datastore WHERE __dt >= current_date - interval '1' day
AND my_timestamp >= now() - interval '1' hour
You may need to adjust the format of the timestamp to suit depending on how you are storing it (epoch seconds, epoch milliseconds, ISO8601 etc. If you set this to run every hour, each time it executes, you will get the last one hour of data. Note that the __dt constraint just helps your query run faster (and cheaper) by limiting the scan to the most recent day only.
2) You can improve on the above by using the delta window function of the dataset which lets you get the data that has arrived since the query last ran more easily. You could then simplify your query to look like;
select * from my_datastore where __dt >= current_date - interval '1' day
And configure the delta time window to look at your timestamp field. You then control how much data is retrieved by the frequency at which you execute the query (every 15 mins, every hour etc).
3) If you have a more general purpose requirement to fetch the data between 2 timestamps that you are calculating programatically, and may not be of the form now() - some interval, the way you could do this is to create a dataset and then update the dataset with the revised SQL expression before running it with create-dataset-content. That way the dataset content is updated with just the results you need with each execution. If this is of interest, I can expand upon the actual python required.
4) As Thomas suggested, it can often be just as easy to pull out a larger chunk of data with the dataset (for example the last day) and then filter down to the timestamp you want in code. This is particularly easy if you are using panda dataframes for example and there are plenty of related questions such as this one that have good answers.

Frankly, the easiest thing would be to do your own time filtering (the result of get_dataset_content is a csv file).
That's what QuickSight does to allow you to navigate the dataset in time.
If this isn't feasible the alternative is to reprocess the datastore with an updated pipeline that filters out everything except the time range you're interested in (more information here). You should note that while it's tempting to use the startTime and endTime parameters for StartPipelineReprocessing, these are only approximate to the nearest hour.

Can I append data to the powerBI dataset rather than replace the whole dataset?

I have 40 million rows in my dataset. Each day I may get an extra 100 rows. Obviously I don't want to have to import the whole 40 million each time I do a data refresh. Is it possible to do an incremental refresh where only the new rows are added?

I don't think incremental update as you describe it is possible yet.
It looks like you can push rows with Power BI REST API, if you're happy to switch to that.
However, you might find this workaround useful:
Split your table and query into two: where date <= 'somedate' and where date >'somedate'
Add an "empty query", use Table.Combine to join your two subtables. Use this as your main table.
Whenever you need to refresh, only refresh the second query (the one with where date >'somedate').
Every once in a while, when that second query starts taking a long time, change somedate to the current date and do a full refresh.

The feature has now been implemented and is called Incremental refresh. Currently it is a premium only feature.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js