Timezone related issues in BigQuery (for partitioning and query) - google-cloud-platform

We have a campaign management system. We create and run campaigns on various channels. When user clicks/accesses any of the Adv (as part of campaign), system generates a log. Our system is hosted in GCP. Using ‘Exports’ feature logs are exported to BigQuery
In BigQuery the Log Table is partitioned using ‘timestamp’ field (time when log is generated). We understand that BigQuery stores dates in UTC timezone and so partitions are also based on UTC time
Using this Log Table, We need to generate Reports per day. Reports can be like number of impressions per each day per campaign. And we need to show these reports as per ETC time.
Because the BigQuery table is partitioned by UTC timezone, query for ETC day would potentially need to scan multiple partitions. Had any one addressed this issue or have suggestions to optimise the storage and query so that its takes complete advantage of BigQuery partition feature
We are planning to use GCP Data studio for Reports.

BigQuery should be smart enough to filter for the correct timezones when dealing with partitions.
For example:
SELECT MIN(datehour) time_start, MAX(datehour) time_end, ANY_VALUE(title) title
FROM `fh-bigquery.wikipedia_v3.pageviews_2018` a
WHERE DATE(datehour) = '2018-01-03'
5.0s elapsed, 4.56 GB processed
For this query we processed the 4.56GB in the 2018-01-03 partition. What if we want to adjust for a day in the US? Let's add this in the WHERE clause:
WHERE DATE(datehour, "America/Los_Angeles") = '2018-01-03'
4.4s elapsed, 9.04 GB processed
Now this query is automatically scanning 2 partitions, as it needs to go across days. For me this is good enough, as BigQuery is able to automatically figure this out.
But what if you wanted to permanently optimize for one timezone? You could create a generated, shifted DATE column - and use that one to PARTITION for.

Related

What is Bigquery Data Store Strategies?

I have a problem related to storing data on GCP Bigquery. I have a partition table which is size 10 Terabyte and increasing day by day. How can I store this data with minimum cost and max performance?
Fisrt Option : I can store last 1 month's data on Bigquery and the rest of the data on GCS.
Second Option: Deleting after the last 1 month's data but this option is illogical to me.
What do you think about this issue?
BigQuery Table
The best solution is to use a BigQuery table which is partition by a usefull date column. A huge part of this table will be charged with the lower long time storage rate. Please consider for your whole project a region-zone, which has lower costs, if this is possible for your organisation, because all needed data needs to be in the same region.
For each query only the needed time and the needed columns are charged.
GCS files
There is an option for using external tables for files stored in GCS. These have some drawbacks: for each query the complete data is read and charged. There are some partition possibilities using hive partition keys (https://cloud.google.com/bigquery/docs/hive-partitioned-queries). It is also not possible to precalculate the cost of a query, which is very bad for testing and debugging.
use cases
If you need only your monthly data for daily reports, it is enough to store these data in BigQuery and the rest in gcs. If you only need to run a query over longer times once a month, you can load the data from gcs into BigQuery and delete the table after your queries.

Google BigQuery analytics prices

I want to analyze about 50GB of data (constantly growing data) using Google BigQuery. But I'm wondering 2 things about bigquery pricing and analytics.
My data content (each row)
COLUMN | ROW
USER_ID --> Unique User ID (e.g zc5zta5h7a6sr)
BUY_COUNT --> INT(e.g 35)
TOTAL_CURRENCY --> USD (e.g. 500$)
etc.
The things I want to show in the chart; TOTAL CURRENCY Number of unique users with $1-999 and 1000-10,000+$.
I know that there is a $5 pricing for each 1TB processed in analysis, but;
1-) 1 GB of new data will be added to the BigQuery table every day. I want to create a live graph on each new data. Will Google bigquery only bill for 1GB of analytics added every day, or will it repeatedly analyze 50GB of data and bill 50+1GB with each new data?
2-) Data with the same id can be added to my constantly updated data set. Is it possible to combine them automatically? For example;
Can I update the BUY_COUNT column in the table id when the user with id zc5zta5h7a6sr makes a new purchase? If possible, how will I be billed for it?
Thank you.
The BigQuery analysis billing occur every time you run a query. About your points:
If your query scans the all table every time, you will be billed for the current size of the table each time the query runs. Are some ways to optimize this, such materialized views, partitioned tables, build an aggregated table etc.
If your aggregation is not much complex, materialized view can help with this point. For e.g., you can have an raw table with unnagregated data and an materialized view which aggregate BUY_COUNT by user. You will be billed for the bytes scanned during the automatic maintance plus the bytes scanned every time a query runs on the view.
More info about the pricing: https://cloud.google.com/bigquery/pricing

Why is this happening in BigQuery Sandbox Environment ? - it only creates empty partitioned tables in BigQuery

It's same for me as well, somehow when running below DDL statement to create Partitioned Table - it's creating empty Partitioned Tables, Not Sure Why!! -- This is happening in BigQuery Sandbox Environment.
Could someone please tell - why this is happening ?
CREATE OR REPLACE TABLE
stack_it.questions_2018_clustered
PARTITION BY DATE(creation_date)
CLUSTER BY tags
AS
SELECT id, title, accepted_answer_id, creation_date, answer_count, comment_count, favorite_count, view_count, tags
FROM
bigquery-public-data.stackoverflow.posts_questions
WHERE creation_date BETWEEN '2018-01-01' AND '2019-01-01';
As mentioned in this documentation, in a BigQuery Sandbox environment, the tables and partitions expire after 60 days and this limit cannot be increased unless upgraded from the Sandbox. The reason for the empty partitioned table is this 60 day limit.
The partition expiration date is calculated independently for each partition based on the partition time. Since, the new table is created with dates that are more than 60 days old (year 2018, 2019), any partitions with these dates will be dropped. I tested this behaviour by creating a table with recent dates (within 60 days) and the new partitioned table was populated with the data as expected. For more information on partition expiration refer here.
To remove the BigQuery Sandbox limit, I would suggest you upgrade your project. After you upgrade from the Sandbox, you can still use the free tier, but you can generate charges. To manage BigQuery quotas, consider setting up cost controls. More info on upgrading from Sandbox and updating the expiration periods can be found here.

How best cache bigquery table for fast lookup of individual row?

I have a raw data table in bigquery that has hundreds of millions of rows. I run a scheduled query every 24 hours to produce some aggregations that results a table in the ballmark of 33 million rows (6gb) but may be expected to grow slowly to approximately double its current size.
I need a way to get 1 row at a time quick access lookup by id to that aggregate table in a separate event driven pipeline. i.e. A process is notified that person A just took an action, what do we know about this person's history from the aggregation table?
Clearly bigquery is the right tool to produce the aggregate table, but not the right tool for the quick lookups. So I need to offset it to a secondary datastore like firestore. But what is the best process to do so?
I can envision a couple strategies:
1) Schedule a dump of agg table to GCS. Kick off a dataflow job to stream contents of gcs dump to pubsub. Create a serverless function to listen to pubsub topic and insert rows into firestore.
2) A long running script on compute engine which just streams the table directly from BQ and runs inserts. (Seems slower than strategy 1)
3) Schedule a dump of agg table to GCS. Format it in such a way that can be directly imported to firestore via gcloud beta firestore import gs://[BUCKET_NAME]/[EXPORT_PREFIX]/
4) Maybe some kind of dataflow job that performs lookups directly against the bigquery table? Not played with this approach before. No idea how costly / performant.
5) some other option I've not considered?
The ideal solution would allow me quick access in milliseconds to an agg row which would allow me to append data to the real time event.
Is there a clear best winner here in the strategy I should persue?
Remember that you could also CLUSTER your table by id - making your lookup queries way faster and less data consuming. They will still take more than a second to run though.
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
You could also set up exports from BigQuery to CloudSQL, for subsecond results:
https://medium.com/#gabidavila/how-to-serve-bigquery-results-from-mysql-with-cloud-sql-b7ddacc99299
And remember, now BigQuery can read straight out of CloudSQL if you'd like it to be your source of truth for "hot-data":
https://medium.com/google-cloud/loading-mysql-backup-files-into-bigquery-straight-from-cloud-sql-d40a98281229

Read spanner data from a table which is simultaneously being written

I'm copying Spanner data to BigQuery through a Dataflow job. The job is scheduled to run every 15 minutes. The problem is, if the data is read from a Spanner table which is also being written at the same time, some of the records get missed while copying to BigQuery.
I'm using readOnlyTransaction() while reading Spanner data. Is there any other precaution that I must take while doing this activity?
It is recommended to use Cloud Spanner commit timestamps to populate columns like update_date. Commit timestamps allow applications to determine the exact ordering of mutations.
Using commit timestamps for update_date and specifying an exact timestamp read, the Dataflow job will be able to find all existing records written/committed since the previous run.
https://cloud.google.com/spanner/docs/commit-timestamp
https://cloud.google.com/spanner/docs/timestamp-bounds
if the data is read from a Spanner table which is also being written at the same time, some of the records get missed while copying to BigQuery
This is how transactions work. They present a 'snapshot view' of the database at the time the transaction was created, so any rows written after this snapshot is taken will not be included.
As #rose-liu mentioned, using commit timestamps on your rows, and keeping track of the timestamp when you last exported (available from the ReadOnlyTransaction object) will allow you to accurately select 'new/updated rows since last export'