Most efficient way to filter BigQuery rows by latest date - google-cloud-platform

I am currently working on an ETL pipeline that uses BigQuery to store staging data, and then uses Dataprep to transform the data and store it in new BigQuery tables for production.
We have been experiencing issues finding the most cost effective way to apply these transforms on a small selection of the data, typically only the last X number of days from the current max date in the staging data table. For example, we need to calculate the max available date in the staging data, and then retrieve all rows within the past 3 days from this date. Unfortunately we can't rely on the 'max date' in the staging data always being up to date (this data is brought in from third party APIs of varying quality and reliability).
At first I tried applying these transforms directly in Dataprep by getting the max date, creating a comparison column using DATEDIFF and then discarding rows more than 3 days older than this 'max date'. This proved to be very time consuming and inefficient in terms of cost.
The next thing we tried was to filter down the data in BigQuery views, which would then be used as the initial datasets for the Dataprep flows (the data would be pre-filtered before Dataprep applies any transforms). We first tried doing this dynamically in BigQuery, like so:
WITH latest_partitiontime AS (SELECT _PARTITIONTIME as pt FROM
`{project}.{dataset}.{table}`
GROUP BY _PARTITIONTIME
ORDER BY _PARTITIONTIME DESC
LIMIT 1)
SELECT {columns}
FROM `{project}.{dataset}.{table}`
WHERE _PARTITIONTIME >= (SELECT pt FROM latest_partitiontime)
But upon preview of the GB/estimated cost of the query, it seems very inefficient and expensive.
The next thing we tried was hard coding the date, which for some reason is a lot cheaper/quicker:
SELECT {columns}
FROM `{project}.{dataset}.{table}`
WHERE _PARTITIONTIME >= '2018-08-08'
So our current plan is to maintain a view for each table, and update the hard coded date in the view SQL via the Python SDK each time the staging data successfully completes (https://cloud.google.com/bigquery/docs/managing-views).
It feels like we are potentially missing a much easier/more efficient solution to this problem. So I wanted to ask:
Is it more cost effective carrying out this initial filtering by date in Dataprep or in BigQuery?
What is the most cost effective way of filtering the data in the chosen product?

Are you familiar with the MERGE statement of standard SQL and the clustering feature released? that could actually merge your data and you can further customize it to read only some partitions.
Example from manual:
MERGE dataset.DetailedInventory T
USING dataset.Inventory S
ON T.product = S.product
WHEN NOT MATCHED AND quantity < 20 THEN
INSERT(product, quantity, supply_constrained, comments)
VALUES(product, quantity, true, ARRAY<STRUCT<created DATE, comment STRING>>[(DATE('2016-01-01'), 'comment1')])
WHEN NOT MATCHED THEN
INSERT(product, quantity, supply_constrained)
VALUES(product, quantity, false)
hint: you can partition by null, and leverage only the 'clustering level'

Related

Does Power BI support Incremental refresh for expanded table?

I have made table as follows (https://apexinsights.net/blog/convert-date-range-to-list):
In this scenario suppose I configure Incremental refresh on the Start Date column, then will Power BI support this correctly. I am asking because - say the refresh is for last 2 days or last 2 months, then it will fetch the source rows, and apply the transform to the partition. But my concern is that I will have to put the date param filter on Start date prior to non folding steps so that the query folds (alternatively power query will auto apply date filter so that query can fold).
So when it pulls the data based on start date and apply the transforms then I'm not able to think clearly about what kind of partitions it will create. Whether it is for Start date or for expabded date. Is query folding supported in this scenario?
This is a quite complicated scenario, where I would probably just avoid adding incremental refresh.
You would have to use the RangeStart/RangeEnd parameters twice in this query. Once that gets folded to the data source to retrieve ranges that overlap with the [RangeStart,RangeEnd) interval and a second time after expanding the ranges to filter out individual rows that fall outside [RangeStart,RangeEnd).

Is there a query or a page where I can see the details of my cost specific to Bigquery?

I can see my total BigQuery cost from the "billing" section.
However, I need to see data such as,
Which table costs me how much? I mean, I need to see the cost of each table individually.
How much cost has been created by the queries made to that table in the last month?
etc.
I would be very happy if you could help with this. I have too many tables to calculate the cost based on the dimensions of the individual tables.
I have published an article about Reducing your BigQuery bills with BI Engine capacity orchestration
which features a query like:
DECLARE var_day STRING DEFAULT '2021-09-09';
SELECT
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.createTime,
round(5* (protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.totalProcessedBytes/POWER(2,40) ),2) AS processedBytesCostProjection,
round(5* (protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.totalBilledBytes/POWER(2,40) ),2) AS billedBytesCostInUSD
FROM
`<dataset_auditlogs>.cloudaudit_googleapis_com_data_access_*`
WHERE
_TABLE_SUFFIX >= var_day and protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.createTime>=TIMESTAMP(var_day)
AND protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.eventName="query_job_completed"
AND protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.totalProcessedBytes IS NOT NULL
ORDER BY
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics. totalProcessedBytes DESC
The query uses a flat rate of 5 USD to calculate the cost of a 1TB on-demand query according to the GCP costs table.
The output is this:
by adding another column:
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.query
you will get the raw query that you can use to optimize your query.
If you want to go further you can use
...job.jobStatistics.referencedTables that lists ALL the tables the query touches to actually see and do some filtering on the tables you want.
the json view helps you to identify the right attribute to query and filter on

How to circumvent SPICE limitations (500 M rows) to create a QuickSight dashboard for a big data set?

My goal is to quickly & dynamically visualize a big data set (> 500 M rows) using QuickSight. To achieve quick query times, it's necessary to load all of the data into SPICE. However, AWS currently has a hard limit for the maximum number of rows that can be imported into SPICE for a single data set, which is 500 M rows. I currently don't see any option that could be used to visualize all of the data. Here are things that I already considered:
Splitting the full data set into individual QS datasets: the problem with this approach is that QuickSight requires that each visual has a single dataset as an input, so values from multiple datasets cannot be shown in the same visual. I'm aware that multiple datasets can be used within one dashboard but that would not suit the use-case of having a single plot visualizing the data.
Pivoting the table: the input table has a lot of rows, so changing the format from long to wide table would circumvent the SPICE row limitations. However, QuickSight doesn't seem to support using an array of columns a y-values to be plotted.
Creating a dataset per visualization: Certain visualizations can theoretically be defined using fewer values than in the original data set. For example, to create a box plot over a set of groups, we mainly need the quartile values for each of the groups to be plotted, rather than the full data set, which would allow us to be below the SPICE limitation. However, QuickSight doesn't allow creating custom plots such as creation of a box plot where quartiles are already pre-processed.
Currently, the only viable approach I see is to create a dashboard per user, since most users would only be interested in a subset of rows from the full data set.
Irrespective of the approach taken, unfortunately, this limitation forces us to do some compromises.
Depending on the number of users, creating a dataset per user might become a headache to manage. So, I would suggest that if possible you use datasets that capture groups of users (example by user group, or user's country).
Pivoting the table might make it harder to build some visuals. As you said, if you pivot multiple values from different rows into an array field, then you would not be able to extract these easily in analyses (you could use string functions and to to extract them that way but there are limitations around this approach too).
Also creating a dataset per visualisation has maintenance overhead in that you would need to update and re-ingest the dataset most times when changing visualisations.
Some other approaches you might consider:
Aggregate multiple rows together Example if your dataset has multiple rows for each user within the same minute, you could aggregate all these into 1 row and summing up values within that minute. The aggregation period should be as large as possible but keep in mind that this will affect the time granularity in your analyses/dashboards
Prune old data If you are more interested in recent data, then you could add a filter to only keep say 1 month of activity. You could then have other non-SPICE (Direct Query) datasets that do not have this restriction but reports would be slower on older data.
Cache in an external database You could load your data into some data warehousing database (such as AWS Redshift) and then not use SPICE in QuickSight. Of course, this will probably get more expensive.

Create a pivot table in Power BI with dates instead of aggregate values?

I have a table of companies with descriptive data about where we are in the sales stage with the company, and the date we entered that specific stage. As can be seen below, the stages are rows in a Process Step column
My objective is to pivot this column so each Process Step is a column, with a date below it, as shown in excel:
I tried to edit the query and pivot the column upon loading it, but for the "aggregate value" column, no matter which column I use as the values column, it results in some form of error
My advice would be not to pivot the table in the query and use measures to get dates that you want. The benefit of not doing so is that you are able to perform all sorts of other analytics. For instance, Sankey chart would be hard to do properly with pivoted table.
To get the pivot table you are showing from Excel, it's as simple as using matrix visual in Power BI and putting Client code in rows and Process Step in Columns, then Effective date in values.
If you need to perform calculations between stages, it's also not too difficult. For instance, you can create a meausure that shows only dates at certain stages, another measure for another stage, and so on. For example:
Date uploaded = CALCULATE(MAX(Table[Effective Date]), FILTER(Table, Table[Process Step] = "Upload"))
Date exported = CALCULATE(MAX(Table[Effective Date]), FILTER(Table, Table[Process Step] = "Export"))
Time upload to export = DATEDIFF([Date uploaded], [Date exported], DAY)
These measures will work in the context of client and assuming there is only one date for the upload step (but no Process step in rows or columns). If another scenario is needed, perhaps a different approach could be taken.
Please let me know if that solves your problem.

Is there any other approach for updating a row in Big Query apart from overwriting the table?

I have a package data with some of its fields as following:
packageid-->string
status--->string
status_type--->string
scans--->record(repeated)
scanid--->string
status--->string
scannedby--->string
Per day, I have a data of 100 000 packages. Total package data size per day becomes 100 MB(approx) and for 1 month it becomes 3GB. For each package, 3-4 updates can come. So do I have to overwrite the package table, every time a package update (e.g. just a change in status field) comes?
Suppose I have data of 3 packages in the table and now the update for 2nd package comes, do I have to overwrite the whole table (deleting and adding the whole data takes 2 transaction per package update)? For 100 000 packages, total transactions will be 10^5 * 10^5 * 2/2.
Is there any other approach for atomic updates without overwriting the table? (as if the table contains 1 million entries and then a package update comes, then overwriting the whole table will be an overhead.)
Currently there is no way to update individual rows. We do see this use case somewhat often, and we recommend something similar to what Mikhail suggested. Basically, if you have some unique ID for a logical row, and a timestamp of the update time to the row data, you can simply add every update as a new row, and apply a view over the table to give you the desired rows.
Your view would look something like this:
SELECT *
FROM (
SELECT
*,
MAX(<timestamp_column>)
OVER (PARTITION BY <id_column>)
AS max_timestamp,
FROM <table>
)
WHERE <timestamp_column> = max_timestamp
(cribbed from here Return only the newest rows from a BigQuery table with a duplicate items)
If your table is partitioned into daily tables (or becomes static after some period), you can then replace the view with the result of the view query after the table stabilizes, and improve your query efficiency.
e.g.
Add Data to TABLE_RAW.
Create view TABLE that performs the above query over TABLE_RAW
At some point after TABLE_RAW is stable, query TABLE with a destination table of TABLE, with write disposition WRITE_TRUNCATE.
Unfortunately, this does add a bit of overhead. That said, for your use case you may be able to just leave the view in place indefinitely, which would simplify things a bit.
You cannot update row in BigQuery table. You can only add one
Overwriting table on each and every transaction - kind of doesn't make sense at all from any prospective
I would suggest just adding each and every transaction as new row.
Meantime, if for any reason (storage cost, query cost, query performance etc.) you want to dedup - you can do batch dedup periodically - let's say daily. In this case, having original data partitioned in daily tables will be beneficial. As at each moment you will need only latest Deduped Table and recent Daily table to query latest transaction. And previous days daily table can be deleted if you worry of storage cost
Biquery supports updates now here, and supports transactions also.