I have a BigQuery table, partitioned by date (for everyday there is one partition).
I would like to add various columns sometimes populated and sometimes missing and a column for a unique-id.
The data need to be searchable through a unique id. The other use case is to aggregate per column.
This unique id will have a cardinality of millions per day.
I would like to use the unique-id for clustering.
Is there any limitation on this? Anyone has tried it?
It's a valid use case to enable clustering on an id column, the amount of values shouldn't cause any limitations.
I can see my total BigQuery cost from the "billing" section.
However, I need to see data such as,
Which table costs me how much? I mean, I need to see the cost of each table individually.
How much cost has been created by the queries made to that table in the last month?
etc.
I would be very happy if you could help with this. I have too many tables to calculate the cost based on the dimensions of the individual tables.
I have published an article about Reducing your BigQuery bills with BI Engine capacity orchestration
which features a query like:
DECLARE var_day STRING DEFAULT '2021-09-09';
SELECT
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.createTime,
round(5* (protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.totalProcessedBytes/POWER(2,40) ),2) AS processedBytesCostProjection,
round(5* (protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.totalBilledBytes/POWER(2,40) ),2) AS billedBytesCostInUSD
FROM
`<dataset_auditlogs>.cloudaudit_googleapis_com_data_access_*`
WHERE
_TABLE_SUFFIX >= var_day and protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.createTime>=TIMESTAMP(var_day)
AND protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.eventName="query_job_completed"
AND protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.totalProcessedBytes IS NOT NULL
ORDER BY
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics. totalProcessedBytes DESC
The query uses a flat rate of 5 USD to calculate the cost of a 1TB on-demand query according to the GCP costs table.
The output is this:
by adding another column:
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.query
you will get the raw query that you can use to optimize your query.
If you want to go further you can use
...job.jobStatistics.referencedTables that lists ALL the tables the query touches to actually see and do some filtering on the tables you want.
the json view helps you to identify the right attribute to query and filter on
I'm asking for the best practice/industrial standard on these types of jobs, this is what I've been doing:
The end goal is to have a replication of the data in BigQuery
Get the data from an API (incrementally using the previous watermark, using a field like updated_at)
Batch load into native BigQuery table (the main table)
Run an Update-ish query, like this
select * (except _rn)
from (select *, row_number() over (partition by <id> order by updated_at desc) as _rn)
where _rn = 1
Essentially, only get the rows which are the most up-to-date. I'm opting for a table instead of a view to facilliate downtream usages.
This methods works for small table, but when the volume increases, it will face some issues:
Whole table will be recreated, whether partitioned or not
If partitioned, I could easily ran into quota limits
I've also looked for other methods, including loading into a staging table and then perform merge operation between them.
So I'm asking for advice on what your preferred methods/patterns/architecture are to achieve the end goals.
Thanks
Is there a way to calculate the final size of a BigQuery table based on the size of the Cloud Storage data?
For example, an 80GB bucket, it's transformed into a 100GB table.
I want an approximation to know if a Cloud Storage bucket could be less than 100GB in BQ.
Thanks!
The answer to your question is hard. It will vary as a function of how the data in the files in GCS are stored. If you have 80GB of data and that data is in CSV the BQ size will be one value but if it is stored in JSON then it will be another value and if its AVRO yet another and so on. It will also be a function of the schema types for your columns and how many columns you have. Google has documented how much storage (in BQ) is required for each of the data types:
In the docs on BQ Storage Pricing there is a table showing the amount of data required to store different column types.
If I needed to know the resulting BQ size from a file of data, I would determine each of my resulting columns, the data size for each column (average) and that would give me the approximate size of a row in the BQ table. From there, I would multiply that by the number of rows in my source files.
Another way you might want to try is to load in some existing files one at a time and see what the "apparent" multiplier is. In theory, that might be a good enough indication for given sets of file / table pairs.
I have two tables both billing data from GCP in two different regions. I want to insert one table into the other. Both tables are partitioned by day, and the larger one is being written to by GCP for billing exports, which is why I want to insert the data into the larger table.
I am attempting the following:
Export the smaller table to Google Cloud Storage (GCS) so it can be imported into the other region.
Import the table from GCS into Big Query.
Use Big Query SQL to run INSERT INTO dataset.big_billing_table SELECT * FROM dataset.small_billing_table
However, I am getting a lot of issues as it won't just let me insert (as there are repeated fields in the schema etc). An example of the dataset can be found here https://bigquery.cloud.google.com/table/data-analytics-pocs:public.gcp_billing_export_v1_EXAMPL_E0XD3A_DB33F1
Thanks :)
## Update ##
So the issue was exporting and importing the data with the Avro format and using the auto-detect schema when importing the table back in (Timestamps were getting confused with integer types).
Solution
Export the small table in JSON format to GCS, use GCS to do the regional transfer of the files and then import the JSON file into a Bigquery table and DONT use schema auto detect (e.g specify the schema manually). Then you can use INSERT INTO no problems etc.
I was able to reproduce your case with the example data set you provided. I used dummy tables, generated from the below queries, in order to corroborate the cases:
Table 1: billing_bigquery
SELECT * FROM `data-analytics-pocs.public.gcp_billing_export_v1_EXAMPL_E0XD3A_DB33F1`
where service.description ='BigQuery' limit 1000
Table 2: billing_pubsub
SELECT * FROM `data-analytics-pocs.public.gcp_billing_export_v1_EXAMPL_E0XD3A_DB33F1`
where service.description ='Cloud Pub/Sub' limit 1000
I will propose two methods for performing this task. However, I must point that the target and the source table must have the same columns names, at least the ones you are going to insert.
First, I used INSERT TO method. However, I would like to stress that, according to the documentation, if your table is partitioned you must include the columns names which will be used to insert new rows. Therefore, using the dummy data already shown, it will be as following:
INSERT INTO `billing_bigquery` ( billing_account_id, service, sku, usage_start_time, usage_end_time, project, labels, system_labels, location, export_time, cost, currency, currency_conversion_rate, usage, credits )#invoice, cost_type
SELECT billing_account_id, service, sku, usage_start_time, usage_end_time, project, labels, system_labels, location, export_time, cost, currency, currency_conversion_rate, usage, credits
FROM `billing_pubsub`
Notice that for nested fields I just write down the fields name, for instance: service and not service.description, because they will already be used. Furthermore, I did not select all the columns in the target dataset but all the columns I selected in the target's tables are required to be in the source's table selection as well.
The second method, you can simply use the Query settings button to append the small_billing_table to the big_billing_table. In BigQuery Console, click in More >> Query settings. Then the settings window will appear and you go to Destination table, check Set a destination table for query results, fill the fields: Project name,
Dataset name and Table name -these are the destination table's information-. Subsequently, in
Destination table write preference check Append to table, which according to the documentation:
Append to table — Appends the query results to an existing table
Then you run the following query:
Select * from <project.dataset.source_table>
Then after running it, the source's table data should be appended in the target's table.