BigQuery very slow on (seemingly) a very simple query - google-cloud-platform

We using GCP logs which being exported into BigQuery using log sink.
We don't have a huge amount of logs but each record seems to be fairly large.
Running a simple query seem to take a lot of time with BigQuery. We wonder is it normal or are we doing anything wrong... And is there anything we can do to make it a bit more practical to analize...
For example, query
SELECT
FORMAT_DATETIME("%Y-%m-%d %H:%M:%S", DATETIME(timestamp, "Australia/Melbourne")) as Melb_time,
jsonPayload.lg.a,
jsonPayload.lg.p
FROM `XXX.webapp_usg_logs.webapp_*`
ORDER BY timestamp DESC
LIMIT 100
takes
Query complete (44.2 sec elapsed, 35.2 MB processed)
Thank you!

Try adding this to your query:
WHERE _TABLE_SUFFIX > FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))
It will filter to get tables with a TABLE_SUFFIX from within the last 3 days only - instead of having BigQuery look at each table from maybe many years of history.

Related

How to fetch data for a specific time period in BigQuery

Please, I created my table using hour time partition. Please, I would like would like to fetch data that was stored in my table in the last X minutes, eg last 5 minutes.
I tried using this command
SELECT *
FROM mydataset.mytable
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 5 MINUTE);
But, it returns a lot more rows than what is expected. I typically store 500 rows every 2 minutes, but this query is returning more than 30000 rows
As #Samuel mentioned in the comment, below example query can be considered to fetch data for a specific time period in BigQuery.
Select * from `dataset.table`
WHERE col_timestamp < TIMESTAMP_SUB(CURRENT_TIMESTAMP(),
INTERVAL 5 MINUTE)
Posting the answer as community wiki for the benefit of the community that might encounter this use case in the future.
Feel free to edit this answer for additional information.

How would one go around creating a due by attribute in redshift

I am currently trying to calculate due by dates in a table by adding the sla time to the time the request was created. From what I am able to understand, the way to go around this is to create a table with the work days and hours and query that table to find the due date. However, redshift does not allow one to declare variables. I was wondering how I would go around creating a work hour table in redshift and if that is not possible, how I would calculate the due date by other means. Thanks!
It appears that you would like to provide a timestamp and then calculate the timestamp that is 'n work hours later', most probably taking into account certain rules such as:
Weekdays: 9am-5pm
Weekends: No Hours
Holidays: Occasional weekdays with No Hours
This could be done by Creating a scalar Python UDF - Amazon Redshift that would be passed a 'start' timestamp and a number of hours, and would return the 'end' timestamp.
Please note that Scalar UDFs cannot access tables or 'call outside' of Redshift, so it would need to be self-contained.
There is code on the web that shows How to find the number of hours between two dates excluding weekends and certain holidays in Python? BusinessHours package - Stack Overflow. You would need to modify such code to specify the duration rather than finding the duration.
The alternate method of "creating a work hour table" would work well when trying to find the number of work hours between two timestamps but would be a bit harder when trying to add workhours to a timestamp.

How can I speed up this Athena Query?

I am running a query through the Athena Query Editor on a table in the Glue Data Catalog and would like to understand why it takes so long to do a simple select * from this data.
Our data is stored in an S3 bucket that is partitioned by year/month/day/hour, with 80 snappy Parquet files per partition that are anywhere between 1 - 10 MB in size each. When I run the following query:
select stringA, stringB, timestampA, timestampB, bigintA, bigintB
from tableA
where year='2021' and month='2' and day = '2'
It scans 700MB but takes over 3 minutes to display the Athena results. I feel that we have already optimized the file format and partitioning for this data, and so I am unsure how else we can improve the performance if we're just trying to select this data out and display it in a tool like QuickSight.
The select * performance was impacted by the number of files that needed to be scanned, which were all relatively small. Repartitioning and removing the hour partition resulted in an improvement in both runtime (14% reduction) and also data scanned (26% reduction) due to snappy compression getting more gains on larger files.
Source: https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

Athena Partition Projection Not Working As Expected

I am moving from registered partitions to partition projection.
Previously my data was partitioned by p_year={yyyy}/p_month={MM}/p_day={dd}/p_hour={HH}/... and I am moving these to p_date={yyyy}-{MM}-{dd} {HH}:00:00/..
I have a recent events table that stores the last 2 days worth of events. And so my p_date range is NOW-2DAYS,NOW. The full table parameters are-
projection.enabled: 'True'
projection.p_date.type: 'date'
projection.p_date.range: NOW-2DAYS,NOW
projection.p_date.format: 'yyyy-MM-dd HH:mm:ss'
projection.p_date.interval: 1
projection.p_date.interval.unit: 'HOURS'
But when I try to query this, I get no results.
SELECT COUNT(*) FROM recent_events_2d_v2
> 0
However, If I change the date range to 2020-09-01 00:00:00,NOW I do get results.
Something seems off with the relative date ranges with partition projection. Can anyone see what I may be doing wrong, or is this a bug?
You need to change your date format to 'yyyy-MM-dd HH:\'00:00\'' (i.e. literal "00:00" instead of minutes and seconds placeholders).
The way partition projection deals with dates leaves some things to be desired. It seems reasonable that if you say the interval is one hour that the timestamps get rounded to the nearest hour, but that's not what happens. Athena will use the actual "now" to generate the partition values, and if your date format contains fields for minutes and seconds, those will be filled in too.
I assume the reason why it worked when you used a hard coded timestamp is that Athena uses that value as the seed for the sequence, and all other timestamps will also be aligned to the hour.
If you are sure your bucket p_date={yyyy}-{MM}-{dd} {HH}:00:00/.. contains data, then you need to make sure that partitions are correctly loaded. Try running
MSCK REPAIR TABLE recent_events_2d_v2
and rerun the query.

AWS IoT Analytics queries for retrieving data from dataset using boto3

Can we use query while retrieving the data from the dataset in AWS IoT Analytics, I want data between 2 timestamps. Im using boto3 to fetch the data. I didn't see any option to use query in get dataset content Below is the boto3 code:
response = client.get_dataset_content(
datasetName='string',
versionId='string'
)
Does anyone have suggestions how to use query or how rerieve the data between 2 timestamp in AWS IoT Analytics?
Thanks,
Pankaj
There could be a few ways to do this depending on what your workflow is, if you have a few more details, that would be helpful.
Possible approaches are;
1) Create a scheduled query to run every hour (for example) where the query looks something like this;
SELECT * FROM my_datastore WHERE __dt >= current_date - interval '1' day
AND my_timestamp >= now() - interval '1' hour
You may need to adjust the format of the timestamp to suit depending on how you are storing it (epoch seconds, epoch milliseconds, ISO8601 etc. If you set this to run every hour, each time it executes, you will get the last one hour of data. Note that the __dt constraint just helps your query run faster (and cheaper) by limiting the scan to the most recent day only.
2) You can improve on the above by using the delta window function of the dataset which lets you get the data that has arrived since the query last ran more easily. You could then simplify your query to look like;
select * from my_datastore where __dt >= current_date - interval '1' day
And configure the delta time window to look at your timestamp field. You then control how much data is retrieved by the frequency at which you execute the query (every 15 mins, every hour etc).
3) If you have a more general purpose requirement to fetch the data between 2 timestamps that you are calculating programatically, and may not be of the form now() - some interval, the way you could do this is to create a dataset and then update the dataset with the revised SQL expression before running it with create-dataset-content. That way the dataset content is updated with just the results you need with each execution. If this is of interest, I can expand upon the actual python required.
4) As Thomas suggested, it can often be just as easy to pull out a larger chunk of data with the dataset (for example the last day) and then filter down to the timestamp you want in code. This is particularly easy if you are using panda dataframes for example and there are plenty of related questions such as this one that have good answers.
Frankly, the easiest thing would be to do your own time filtering (the result of get_dataset_content is a csv file).
That's what QuickSight does to allow you to navigate the dataset in time.
If this isn't feasible the alternative is to reprocess the datastore with an updated pipeline that filters out everything except the time range you're interested in (more information here). You should note that while it's tempting to use the startTime and endTime parameters for StartPipelineReprocessing, these are only approximate to the nearest hour.