Estimate execution time to run BigQuery if table volume is increased

Estimate execution time to run BigQuery if table volume is increased - google-cloud-platform

I have been asked a requirement to estimate query execution time to run a query in GCP BigQuery. The current table has size of around 1 GB, and the query is straight-forward with simple aggregation. For example:-
select a, sum(b) from table1 group by a;
I can see the execution time for this query, also, I monitored the slots consumed during process which is around 5.
Now I am asked to find the execution time of same query if the table size is increased to 100 GB. I do not have that data now, but have been asked to provide estimate for execution time. Can we derive any method to extrapolate beyond given table to estimate execution time?
I am also assuming this is the only query which I will be executing in a given time. So all slots will be available for this query.

Related

Great Expectation profiling on SparkDF takes a long time when there are many columns

I need to profile data coming from snowflake in Databricks. The data is just a sample of 100 rows but containing 3k+ columns, and will eventually have more rows. When I reduce the number of columns, the profiling is done very fast but the more columns there are, the longer it gets. I tried profiling the sample and after more than 10h and I had to cancel the job.
Here is the code I use
df = spark.read.format('snowflake').options(**sfOptions).option('query', f'select * from {db_name}')
df_ge = ge.dataset.SparkDFDataset(df_sf)
BasicDatasetProfiler.profile(df_ge)
You can test this with any data having a lot of columns. Is this normal or am I doing something wrong?

Basically, GE computes metrics for each columns individually, hence, it make an action (probably a collect) for each column and each metric it computes. collects are the most expensive operations you can have on spark so that is almost normal that, the more columns you have, the longer it takes.

Does "limit" reduce the amount of scanned data on AWS Athena?

I have S3 with compressed JSON data partitioned by year/month/day.
I was thinking that it might reduce the amount of scanned data if construct query with filtering looking something like this:
...
AND year = 2020
AND month = 10
AND day >= 1 "
ORDER BY year, month, day DESC
LIMIT 1
Is this combination of partitioning, order and limit an effective measure to reduce the amount of data being scanned per query?

Partitioning is definitely an effective way to reduce the amount of data that is scanned by Athena. A good article that focuses on performance optimization can be found here: https://aws.amazon.com/de/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/ - and better performance mostly comes from reducing the amount of data that is scanned.
It's also recommended to store the data in a column based format, like Parquet and additionally compress the data. If you store data like that you can optimize queries by just selecting columns you need (there is a difference between select * and select col1,col2,.. in this case).
ORDER BY definitely doesn't limit the data that is scanned, as you need to scan all of the columns in the order by clause to be able to order them. As you have JSON as underlying storage it most likely reads all data.
LIMIT will potentially reduce the amount of data that is read, it depends on the overall size of the data - if limit is way smaller than the overall count of rows it will help.
In general I can recommend to test queries in the Athena interface in AWS - it will tell you the amount of scanned data after a successful execution. I tested on one of my partitioned tables (based on compressed parquet):
partition columns in WHERE clause reduces the amount of scanned data
LIMIT further reduces the amount of scanned data in some cases
ORDER BY leads to reading the all partitions again because it otherwise can't be sorted

BigQuery very slow on (seemingly) a very simple query

We using GCP logs which being exported into BigQuery using log sink.
We don't have a huge amount of logs but each record seems to be fairly large.
Running a simple query seem to take a lot of time with BigQuery. We wonder is it normal or are we doing anything wrong... And is there anything we can do to make it a bit more practical to analize...
For example, query
SELECT
FORMAT_DATETIME("%Y-%m-%d %H:%M:%S", DATETIME(timestamp, "Australia/Melbourne")) as Melb_time,
jsonPayload.lg.a,
jsonPayload.lg.p
FROM `XXX.webapp_usg_logs.webapp_*`
ORDER BY timestamp DESC
LIMIT 100
takes
Query complete (44.2 sec elapsed, 35.2 MB processed)
Thank you!

Try adding this to your query:
WHERE _TABLE_SUFFIX > FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))
It will filter to get tables with a TABLE_SUFFIX from within the last 3 days only - instead of having BigQuery look at each table from maybe many years of history.

AWS IoT Analytics queries for retrieving data from dataset using boto3

Can we use query while retrieving the data from the dataset in AWS IoT Analytics, I want data between 2 timestamps. Im using boto3 to fetch the data. I didn't see any option to use query in get dataset content Below is the boto3 code:
response = client.get_dataset_content(
datasetName='string',
versionId='string'
)
Does anyone have suggestions how to use query or how rerieve the data between 2 timestamp in AWS IoT Analytics?
Thanks,
Pankaj

There could be a few ways to do this depending on what your workflow is, if you have a few more details, that would be helpful.
Possible approaches are;
1) Create a scheduled query to run every hour (for example) where the query looks something like this;
SELECT * FROM my_datastore WHERE __dt >= current_date - interval '1' day
AND my_timestamp >= now() - interval '1' hour
You may need to adjust the format of the timestamp to suit depending on how you are storing it (epoch seconds, epoch milliseconds, ISO8601 etc. If you set this to run every hour, each time it executes, you will get the last one hour of data. Note that the __dt constraint just helps your query run faster (and cheaper) by limiting the scan to the most recent day only.
2) You can improve on the above by using the delta window function of the dataset which lets you get the data that has arrived since the query last ran more easily. You could then simplify your query to look like;
select * from my_datastore where __dt >= current_date - interval '1' day
And configure the delta time window to look at your timestamp field. You then control how much data is retrieved by the frequency at which you execute the query (every 15 mins, every hour etc).
3) If you have a more general purpose requirement to fetch the data between 2 timestamps that you are calculating programatically, and may not be of the form now() - some interval, the way you could do this is to create a dataset and then update the dataset with the revised SQL expression before running it with create-dataset-content. That way the dataset content is updated with just the results you need with each execution. If this is of interest, I can expand upon the actual python required.
4) As Thomas suggested, it can often be just as easy to pull out a larger chunk of data with the dataset (for example the last day) and then filter down to the timestamp you want in code. This is particularly easy if you are using panda dataframes for example and there are plenty of related questions such as this one that have good answers.

Frankly, the easiest thing would be to do your own time filtering (the result of get_dataset_content is a csv file).
That's what QuickSight does to allow you to navigate the dataset in time.
If this isn't feasible the alternative is to reprocess the datastore with an updated pipeline that filters out everything except the time range you're interested in (more information here). You should note that while it's tempting to use the startTime and endTime parameters for StartPipelineReprocessing, these are only approximate to the nearest hour.

Redshift Query taking too much time

In Redshift, the queries are taking too much time to execute. Some queries keep on running or get aborted after some time.
I have very limited knowledge of Redshift and it is getting difficult to understand the Query plan to optimise the query.
Sharing one of the queries that we run, along with the Query Plan.
The query is taking 20 seconds to execute.
Query
SELECT
date_trunc('day',
ti) as date,
count(distinct deviceID) AS COUNT
FROM
live_events
WHERE
brandID = 3927
AND ti >= '2017-08-02T00:00:00+00:00'
AND ti <= '2017-09-02T00:00:00+00:00'
GROUP BY
1
Primary key
brandID
Interleaved Sort Keys
we have set following columns as interleaved sort keys -
brandID, ti, event_name
QUERY PLAN

You have 126 million rows in that table. It's going to take more than a second on a single dc1.large node.
Here's some ways you could improve the performance:
More nodes
Spreading data across more nodes allows more parallelization. Each node adds additional processing and storage. Even if your data volume only justifies one node, if you want more performance, add more nodes.
SORTKEY
For the right type of query, the SORTKEY can be the best way to improve query speed. Sorting data on disk allows Redshift to skip over blocks that it knows does not contain relevant data.
For example, your query has WHERE brandID = 3927, so having brandID as the SORTKEY would make this extremely efficient because very few disk blocks would contain data for one brand.
Interleaved sorting is rarely the best sorting method to use because it is less efficient than a single or compound sort key and takes a long time to VACUUM. If the query you have shown is typical of the type of queries you are running, then use a compound sort key of brandId, ti or ti, brandId. It will be much more efficient.
SORTKEYs are typically a date column, since they are often found in a WHERE clause and the table will be automatically sorted if data is always appended in time order.
The Interleaved Sort would be causing Redshift to read many more disk blocks to find your data, thereby significantly increasing query time.
DISTKEY
The DISTKEY should typically be set to the field that is most used in a JOIN statement on the table. This is because data relating to the same DISTKEY value is stored on the same slice. This won't have such a large impact on a single node cluster, but it is still worth getting right.
Again, you have only shown one type of query, so it is hard to recommend a DISTKEY. Based on this query alone, I would recommend DISTKEY EVEN so that all slices participate in the query. (It is also the default DISTKEY if no specific DISTKEY is selected.) Alternatively, set DISTKEY to a field not shown -- but certainly don't use brandId as the DISTKEY otherwise only one slice will participate in the query shown.
VACUUM
VACUUM your tables regularly so that the data is stored in SORTKEY order and deleted data is removed from storage.
Experiment!
Optimal settings depend upon your data and the queries you typically run. Perform some tests to compare SORTKEY and DISTKEY values and choose the settings that perform the best. Then, test again in 3 months to see if your queries or data has changed enough to make other settings more efficient.

Some time the issue could be due to locks being acquired by other processes. You can refer: https://aws.amazon.com/premiumsupport/knowledge-center/prevent-locks-blocking-queries-redshift/

I'd also like to add that in your query you are performing date transformations. Date operations are expensive in Redshift.
-- This date operation is expensive
date_trunc('day', ti) as date
If you have the luxury you should store the date in the format you need in an additional column.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js