Update bigquery table soon after insert records in Table

Update bigquery table soon after insert records in Table - google-cloud-platform

I have a requirement to load the data after a few minutes I need to update the record, how can I achieve that?? I am getting google.api_core.exceptions.BadRequest: 400 UPDATE or DELETE statement over table dataset.tablename would affect rows in the streaming buffer, which is not supported
Is there any way to flush the data from streaming buffer to permanent storage??
I tried below option but this query also getting the same error.
UPDATE dataset.tablename
SET _PARTITIONTIME = CURRENT_TIMESTAMP()
WHERE _PARTITIONTIME IS NULL```

Streamed data is not immediately available for operations outside of analysis (select) for up to 90 minutes (typically much less). You can use streamingBuffer.oldestEntryTime to see the age of the oldest row in the streaming buffer in the tables.get response.
https://cloud.google.com/bigquery/docs/streaming-data-into-bigquery#dataavailability
As a potential workaround, you could create an independent table with desired changes and join it in a query/view with the table you're streaming to, to see newer values in your query results. Eventually, you could use the "change" table to merge changes into the original table.
https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#merge_statement

Related

Schedule the creation a partitioned table overwriting an existing table in BigQuery GCP

Yesterday I scheduled daily the overwriting of a table. The new table will be partitioned as well as the overwritten one... It did not run at the corresponding time, nor gave an error... It just did not started.
My feeling is that it has to be with the partitioning option. To mention that the casting of the field date_formatted that will be used as partition field works fine.
As far as I know, when scheduling a query you can't use the create or replace table T partitioned by column C as select...
You starts from the select... clause, as shows in the image, and I don't know if the problem comes from here.
PS: I had no troubles scheduling the appending to a partitioned by day table with this same procedure.

the destination table is in the same dataset.
if the very same query is scheduled to deliver the results in a table with the same name, but in a different dataset (located in the same project), it works.
by the way, the input table and the output table never were the same.

How to backfill partitioned data in BigQuery?

I am trying to backfill data from GCP billing export table to another table say T1.
Both tables are partitioned.
Below scheduled query runs everyday to get yesterday’s data.
SELECT * FROM gcp_billing_export_v1 WHERE DATE(_PARTITIONTIME) = DATE_ADD(CURRENT_DATE(), INTERVAL -1 DAY)
Now I need to backfill the data , say for 15th May - how do I do that ?
I tried the backfill feature with the below query - expecting the backfill utility to take the past date i.e. May 15th as a param for the #run_date but that didn’t help.
SELECT * FROM gcp_billing_export_v1 WHERE DATE(_PARTITIONTIME) = #run_date
The data is pulled for 15th May from the source table(gcp_billing_export_v1) but is populated against current date in the destination table i.e May 15th May data is populated against June 22nd in the destination table T1. Where am I going wrong ?
Any guidance ?

Looks like you're using ingestion partitioning.
You would need to create a new table with the partitioning you want ie EventDate and populate that new table with historical and new daily data - as you can't overwrite an existing partition.
Link here: https://cloud.google.com/bigquery/docs/querying-partitioned-tables#query_an_ingestion-time_partitioned_table

As #Lemon already pointed out that you're using Ingestion time partitioned tables(both source and dest), you need to understand how it works. Ingestion time partitioned tables are different from the Regular partitioned tables.
From the Documentation-
When you create a table partitioned by ingestion time,BigQuery automatically assigns rows to partitions based on the time when BigQuery ingests the data.
This type of table has a pseudo-column named _PARTITIONTIME.The value of this column is the ingestion time for each row.
Since you are using the SELECT * FROM gcp_billing_export_v1 you are getting all the data but without any _PARTITIONTIME column. And when you are saving the same result into the destination table , it is updating the _PARTITIONTIME column as per destintaion table's data ingestion-time.
Thus you have old data with the current date in _PARTITIONTIME
To avoid this your destination table needs to be either a normal table or regular partitioned table.
Also you need to have an extra column to hold the Datetime value from the source's_PARTITIONTIME column.You can create a regular partition on this new column.
Then to get _PARTITIONTIME in your result set , in your quer you need to mention the column name specifically in your query.
SELECT *,_PARTITIONTIME AS ingestionTime
FROM gcp_billing_export_v1
WHERE DATE(_PARTITIONTIME) = #run_date
The above query will return all the data from the gcp_billing_export_v1 table with 1 extra column ingestionTime.
Now you can backfill the data for 15th,May and save it to the new table.
You can also tweak around this below query to achieve the same
SELECT *,_PARTITIONTIME AS ingestionTime
FROM gcp_billing_export_v1
WHERE DATE(_PARTITIONTIME) = DATE_ADD(#run_date, INTERVAL -1 DAY)
It will run daily as per your need .Now if you want to pull data for 15th,May then you have to schedule the backfill for 16th,May(as per the where clause)

PowerBI Incremental Refresh Duplicates

I have setup a PowerBI dataset with incremental refresh following this guide https://learn.microsoft.com/en-us/power-bi/connect-data/incremental-refresh-configure and ensured that all tables have RangeStart > x and RangeEnd <= x to ensure only one side has the =. I continued to investigate https://learn.microsoft.com/en-us/power-bi/connect-data/incremental-refresh-troubleshoot and noticed there is a comment
With a refresh operation, only data that has changed at the data source is refreshed in the dataset. As the data is divided by a date, it’s recommended post (transaction) dates are not changed.
Which to me sounds extremely limiting. Our data has two date fields LastModified and RowCreatedAt that are both date/time columns. Last Modified is the real date/time of the last modification to the data in the row. RowCreatedAt is the real date/time of when that modification was persisted to the database. These can be very different (eg, if the customer is new, but has legacy data, the LastModified date may be very old, but RowCreatedAt will be very recent).
I decided to go with the RowCreatedAt value since that is something that we control (eg, if we were to refresh LastModifiedDate and load in historical data, it would never be imported to PowerBI after the initial refresh). Both the LastModifiedDate and RowCreatedAt fields are updated when data changes in the system (eg, sales order gets a new line item added to it).
My expectation was that when data changed and the partition date was updated, it would properly update the data in the dataset (eg, remove the old row and insert the new row since the same primary key, but other data is changed). This seems completely normal and expected behavior, but from the documentation, it seems like you can only import data which is not ever going to change or you have to refresh your history to the point where the change occurred. This seems like a crazy limitation (eg, who has unchanging data for all time??) so I'm hopefully just misunderstanding something.

BigQuery DML on column partitioned tables with streaming buffer

As far as I understand, UPDATEs and DELETEs are working on partitioned tables with streaming buffer, if query is not touching any records in streaming buffer. Otherwise, the following error is reported:
UPDATE or DELETE statement over table project.dataset.table would affect rows in the streaming buffer, which is not supported
Issue is similar to discussed this question, however it's about column partitioned tables, not about ingestion-time partitioned tables.
Problem is, that while ingestion-time partitioned have means to ignore data in streaming buffer via conditions on _PARTITIONTIME, it's not available for column-partitioned tables. Are there any other approaches that would allow to ignore streaming buffer data in DML statements?

At the moment you can only use Legacy SQL to get information about the streaming buffer.
Get all data from the streaming buffer like this:
#legacySQL
select MIN(partitioned_tstamp) AS min_tstamp
, MAX(partitioned_tstamp) AS max_tstamp
, COUNT(1) AS lines
FROM [my_dataset_id.mystreaming_data_table$__UNPARTITIONED__]
And get a summary of all partitions in the table like this:
#legacySQL
SELECT *
FROM [my_dataset_id.mystreaming_data_table$__PARTITIONS_SUMMARY__]
I have no idea why this isn't supported yet in standard SQL or when it will be.

Is there any other approach for updating a row in Big Query apart from overwriting the table?

I have a package data with some of its fields as following:
packageid-->string
status--->string
status_type--->string
scans--->record(repeated)
scanid--->string
status--->string
scannedby--->string
Per day, I have a data of 100 000 packages. Total package data size per day becomes 100 MB(approx) and for 1 month it becomes 3GB. For each package, 3-4 updates can come. So do I have to overwrite the package table, every time a package update (e.g. just a change in status field) comes?
Suppose I have data of 3 packages in the table and now the update for 2nd package comes, do I have to overwrite the whole table (deleting and adding the whole data takes 2 transaction per package update)? For 100 000 packages, total transactions will be 10^5 * 10^5 * 2/2.
Is there any other approach for atomic updates without overwriting the table? (as if the table contains 1 million entries and then a package update comes, then overwriting the whole table will be an overhead.)

Currently there is no way to update individual rows. We do see this use case somewhat often, and we recommend something similar to what Mikhail suggested. Basically, if you have some unique ID for a logical row, and a timestamp of the update time to the row data, you can simply add every update as a new row, and apply a view over the table to give you the desired rows.
Your view would look something like this:
SELECT *
FROM (
SELECT
*,
MAX(<timestamp_column>)
OVER (PARTITION BY <id_column>)
AS max_timestamp,
FROM <table>
)
WHERE <timestamp_column> = max_timestamp
(cribbed from here Return only the newest rows from a BigQuery table with a duplicate items)
If your table is partitioned into daily tables (or becomes static after some period), you can then replace the view with the result of the view query after the table stabilizes, and improve your query efficiency.
e.g.
Add Data to TABLE_RAW.
Create view TABLE that performs the above query over TABLE_RAW
At some point after TABLE_RAW is stable, query TABLE with a destination table of TABLE, with write disposition WRITE_TRUNCATE.
Unfortunately, this does add a bit of overhead. That said, for your use case you may be able to just leave the view in place indefinitely, which would simplify things a bit.

You cannot update row in BigQuery table. You can only add one
Overwriting table on each and every transaction - kind of doesn't make sense at all from any prospective
I would suggest just adding each and every transaction as new row.
Meantime, if for any reason (storage cost, query cost, query performance etc.) you want to dedup - you can do batch dedup periodically - let's say daily. In this case, having original data partitioned in daily tables will be beneficial. As at each moment you will need only latest Deduped Table and recent Daily table to query latest transaction. And previous days daily table can be deleted if you worry of storage cost

Biquery supports updates now here, and supports transactions also.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js