Is there any other approach for updating a row in Big Query apart from overwriting the table? - sql-update

I have a package data with some of its fields as following:
packageid-->string
status--->string
status_type--->string
scans--->record(repeated)
scanid--->string
status--->string
scannedby--->string
Per day, I have a data of 100 000 packages. Total package data size per day becomes 100 MB(approx) and for 1 month it becomes 3GB. For each package, 3-4 updates can come. So do I have to overwrite the package table, every time a package update (e.g. just a change in status field) comes?
Suppose I have data of 3 packages in the table and now the update for 2nd package comes, do I have to overwrite the whole table (deleting and adding the whole data takes 2 transaction per package update)? For 100 000 packages, total transactions will be 10^5 * 10^5 * 2/2.
Is there any other approach for atomic updates without overwriting the table? (as if the table contains 1 million entries and then a package update comes, then overwriting the whole table will be an overhead.)

Currently there is no way to update individual rows. We do see this use case somewhat often, and we recommend something similar to what Mikhail suggested. Basically, if you have some unique ID for a logical row, and a timestamp of the update time to the row data, you can simply add every update as a new row, and apply a view over the table to give you the desired rows.
Your view would look something like this:
SELECT *
FROM (
SELECT
*,
MAX(<timestamp_column>)
OVER (PARTITION BY <id_column>)
AS max_timestamp,
FROM <table>
)
WHERE <timestamp_column> = max_timestamp
(cribbed from here Return only the newest rows from a BigQuery table with a duplicate items)
If your table is partitioned into daily tables (or becomes static after some period), you can then replace the view with the result of the view query after the table stabilizes, and improve your query efficiency.
e.g.
Add Data to TABLE_RAW.
Create view TABLE that performs the above query over TABLE_RAW
At some point after TABLE_RAW is stable, query TABLE with a destination table of TABLE, with write disposition WRITE_TRUNCATE.
Unfortunately, this does add a bit of overhead. That said, for your use case you may be able to just leave the view in place indefinitely, which would simplify things a bit.

You cannot update row in BigQuery table. You can only add one
Overwriting table on each and every transaction - kind of doesn't make sense at all from any prospective
I would suggest just adding each and every transaction as new row.
Meantime, if for any reason (storage cost, query cost, query performance etc.) you want to dedup - you can do batch dedup periodically - let's say daily. In this case, having original data partitioned in daily tables will be beneficial. As at each moment you will need only latest Deduped Table and recent Daily table to query latest transaction. And previous days daily table can be deleted if you worry of storage cost

Biquery supports updates now here, and supports transactions also.

Related

How to backfill partitioned data in BigQuery?

I am trying to backfill data from GCP billing export table to another table say T1.
Both tables are partitioned.
Below scheduled query runs everyday to get yesterday’s data.
SELECT * FROM gcp_billing_export_v1 WHERE DATE(_PARTITIONTIME) = DATE_ADD(CURRENT_DATE(), INTERVAL -1 DAY)
Now I need to backfill the data , say for 15th May - how do I do that ?
I tried the backfill feature with the below query - expecting the backfill utility to take the past date i.e. May 15th as a param for the #run_date but that didn’t help.
SELECT * FROM gcp_billing_export_v1 WHERE DATE(_PARTITIONTIME) = #run_date
The data is pulled for 15th May from the source table(gcp_billing_export_v1) but is populated against current date in the destination table i.e May 15th May data is populated against June 22nd in the destination table T1. Where am I going wrong ?
Any guidance ?
Looks like you're using ingestion partitioning.
You would need to create a new table with the partitioning you want ie EventDate and populate that new table with historical and new daily data - as you can't overwrite an existing partition.
Link here: https://cloud.google.com/bigquery/docs/querying-partitioned-tables#query_an_ingestion-time_partitioned_table
As #Lemon already pointed out that you're using Ingestion time partitioned tables(both source and dest), you need to understand how it works. Ingestion time partitioned tables are different from the Regular partitioned tables.
From the Documentation-
When you create a table partitioned by ingestion time,BigQuery automatically assigns rows to partitions based on the time when BigQuery ingests the data.
This type of table has a pseudo-column named _PARTITIONTIME.The value of this column is the ingestion time for each row.
Since you are using the SELECT * FROM gcp_billing_export_v1 you are getting all the data but without any _PARTITIONTIME column. And when you are saving the same result into the destination table , it is updating the _PARTITIONTIME column as per destintaion table's data ingestion-time.
Thus you have old data with the current date in _PARTITIONTIME
To avoid this your destination table needs to be either a normal table or regular partitioned table.
Also you need to have an extra column to hold the Datetime value from the source's_PARTITIONTIME column.You can create a regular partition on this new column.
Then to get _PARTITIONTIME in your result set , in your quer you need to mention the column name specifically in your query.
SELECT *,_PARTITIONTIME AS ingestionTime
FROM gcp_billing_export_v1
WHERE DATE(_PARTITIONTIME) = #run_date
The above query will return all the data from the gcp_billing_export_v1 table with 1 extra column ingestionTime.
Now you can backfill the data for 15th,May and save it to the new table.
You can also tweak around this below query to achieve the same
SELECT *,_PARTITIONTIME AS ingestionTime
FROM gcp_billing_export_v1
WHERE DATE(_PARTITIONTIME) = DATE_ADD(#run_date, INTERVAL -1 DAY)
It will run daily as per your need .Now if you want to pull data for 15th,May then you have to schedule the backfill for 16th,May(as per the where clause)

Most efficient way to filter BigQuery rows by latest date

I am currently working on an ETL pipeline that uses BigQuery to store staging data, and then uses Dataprep to transform the data and store it in new BigQuery tables for production.
We have been experiencing issues finding the most cost effective way to apply these transforms on a small selection of the data, typically only the last X number of days from the current max date in the staging data table. For example, we need to calculate the max available date in the staging data, and then retrieve all rows within the past 3 days from this date. Unfortunately we can't rely on the 'max date' in the staging data always being up to date (this data is brought in from third party APIs of varying quality and reliability).
At first I tried applying these transforms directly in Dataprep by getting the max date, creating a comparison column using DATEDIFF and then discarding rows more than 3 days older than this 'max date'. This proved to be very time consuming and inefficient in terms of cost.
The next thing we tried was to filter down the data in BigQuery views, which would then be used as the initial datasets for the Dataprep flows (the data would be pre-filtered before Dataprep applies any transforms). We first tried doing this dynamically in BigQuery, like so:
WITH latest_partitiontime AS (SELECT _PARTITIONTIME as pt FROM
`{project}.{dataset}.{table}`
GROUP BY _PARTITIONTIME
ORDER BY _PARTITIONTIME DESC
LIMIT 1)
SELECT {columns}
FROM `{project}.{dataset}.{table}`
WHERE _PARTITIONTIME >= (SELECT pt FROM latest_partitiontime)
But upon preview of the GB/estimated cost of the query, it seems very inefficient and expensive.
The next thing we tried was hard coding the date, which for some reason is a lot cheaper/quicker:
SELECT {columns}
FROM `{project}.{dataset}.{table}`
WHERE _PARTITIONTIME >= '2018-08-08'
So our current plan is to maintain a view for each table, and update the hard coded date in the view SQL via the Python SDK each time the staging data successfully completes (https://cloud.google.com/bigquery/docs/managing-views).
It feels like we are potentially missing a much easier/more efficient solution to this problem. So I wanted to ask:
Is it more cost effective carrying out this initial filtering by date in Dataprep or in BigQuery?
What is the most cost effective way of filtering the data in the chosen product?
Are you familiar with the MERGE statement of standard SQL and the clustering feature released? that could actually merge your data and you can further customize it to read only some partitions.
Example from manual:
MERGE dataset.DetailedInventory T
USING dataset.Inventory S
ON T.product = S.product
WHEN NOT MATCHED AND quantity < 20 THEN
INSERT(product, quantity, supply_constrained, comments)
VALUES(product, quantity, true, ARRAY<STRUCT<created DATE, comment STRING>>[(DATE('2016-01-01'), 'comment1')])
WHEN NOT MATCHED THEN
INSERT(product, quantity, supply_constrained)
VALUES(product, quantity, false)
hint: you can partition by null, and leverage only the 'clustering level'

Prevent duplicates on insert with billions of rows in SQL Data Warehouse?

I'm trying to determine if there's a practical way to prevent duplicate rows from being inserted into a table using Azure SQL DW when the table already holds billions of rows (say 20 billion).
The root cause of needing this is that the source of the data is a third party that sends over supposedly unique data, but sometimes sends duplicates which have no identifying key. I unfortunately have no idea if we've already received the data they're sending.
What I've tried is to create a table that contains a row hash column (pre-calculated from several other columns) and distribute the data based on that row hash. For example:
CREATE TABLE [SomeFact]
(
Row_key BIGINT NOT NULL IDENTITY,
EventDate DATETIME NOT NULL,
EmailAddress NVARCHAR(200) NOT NULL,
-- other rows
RowHash BINARY(16) NOT NULL
)
WITH
(
DISTRIBUTION = HASH(RowHash)
)
The insert SQL is approximately:
INSERT INTO [SomeFact]
(
EmailAddress,
EventDate,
-- Other rows
RowHash
)
SELECT
temp.EmailAddress,
temp.EventDate,
-- Other rows
temp.RowHash
FROM #StagingTable temp
WHERE NOT EXISTS (SELECT 1 FROM [SomeFact] f WHERE f.RowHash = temp.RowHash);
Unfortunately, this is just too slow. I added some statistics and even created a secondary index on RowHash and inserts of any real size (10 million rows, for example) won't run successfully without erroring due to transaction sizes. I've also tried batches of 50,000 and those too are simply too slow.
Two things I can think of that wouldn't have the singleton records you have in your query would be to
Outer join your staging table with the fact table and filter on some NULL values. Assuming You're using Clustered Column Store in your fact table this should be a lot more inexpensive than the above.
Do a CTAS with a Select Distinct from the existing fact table, and a Select Distinct from the staging table joined together with a UNION.
My gut says the first option will be faster, but you'll probably want to look at the query plan and test both approaches.
Can you partition the 'main' table by EventDate and, assuming new data has a recent EventDate, CTAS out only the partitions that include the EventDate's of the new data, then 'Merge' the data with CTAS / UNION of the 'old' and 'new' data into a table with the same partition schema (UNION will remove the duplicates) or use the INSERT method you developed against the smaller table, then swap the partition(s) back into the 'main' table.
Note - There is a new option on the partition swap command that allows you to directly 'swap in' a partition in one step: "WITH (TRUNCATE_TARGET = ON)".

Can I append data to the powerBI dataset rather than replace the whole dataset?

I have 40 million rows in my dataset. Each day I may get an extra 100 rows. Obviously I don't want to have to import the whole 40 million each time I do a data refresh. Is it possible to do an incremental refresh where only the new rows are added?
I don't think incremental update as you describe it is possible yet.
It looks like you can push rows with Power BI REST API, if you're happy to switch to that.
However, you might find this workaround useful:
Split your table and query into two: where date <= 'somedate' and where date >'somedate'
Add an "empty query", use Table.Combine to join your two subtables. Use this as your main table.
Whenever you need to refresh, only refresh the second query (the one with where date >'somedate').
Every once in a while, when that second query starts taking a long time, change somedate to the current date and do a full refresh.
The feature has now been implemented and is called Incremental refresh. Currently it is a premium only feature.

Redshift UPDATE prohibitively slow

I have a table in a Redshift cluster with ~1 billion rows. I have a job that tries to update some column values based on some filter. Updating anything at all in this table is incredibly slow. Here's an example:
SELECT col1, col2, col3
FROM SOMETABLE
WHERE col1 = 'a value of col1'
AND col2 = 12;
The above query returns in less than a second, because I have sortkeys on col1 and col2. There is only one row that meets this criteria, so the result set is just one row. However, if I run:
UPDATE SOMETABLE
SET col3 = 20
WHERE col1 = 'a value of col1'
AND col2 = 12;
This query takes an unknown amount of time (I stopped it after 20 minutes). Again, it should be updating one column value of one row.
I have also tried to follow the documentation here: http://docs.aws.amazon.com/redshift/latest/dg/merge-specify-a-column-list.html, which talks about creating a temporary staging table to update the main table, but got the same results.
Any idea what is going on here?
You didn't mention what percentage of the table you're updating but it's important to note that an UPDATE in Redshift is a 2 step process:
Each row that will be changed must be first marked for deletion
Then a new version of the data must be written for each column in the table
If you have a large number of columns and/or are updating a large number of rows then this process can be very labor intensive for the database.
You could experiment with using a CREATE TABLE AS statement to create a new "updated" version of the table and then dropping the existing table and renaming the new table. This has the added benefit of leaving you with a fully sorted table.
Actually I don't think RedShift is designed for bulk updates, RedShift is designed for OLAP instead of OLTP, update operations are inefficient on RedShift by nature.
In this use case, I would suggest to do INSERT instead of UPDATE, while add another column of the TIMESTAMP, and when you do analysis on RedShift, you'll need extra logic to get the latest TIMESTAMP to eliminate possible duplicated data entries.