I am running a sequence of queries in BigQuery that take data from a table, enrich/transform them and load into other tables within the same project.
Very high level query structure looks as such:
WITH query_1 as (
SELECT columns from Table_A
WHERE some_condition = some_value
),
query_2 as (
SELECT processing_function(columns)
from A)
SELECT * from query_2
I am calling this query through Python and specifying the destination table in the query job config.
The Table_A mentioned above has about 2 TB of data for a day and I am looping this operation for 10 days. After processing just 1 day of data, BigQuery gave the following error
403 Quota exceeded: Your usage exceeded quota for ExtractBytesPerDay. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors
I checked the quota and it is pushing the 10TB quota limit I have but I'm not able to figure out what ExtractBytesPerDay quota really is. I can increase the quota limit, but I would like to evaluate how much I would need to increase it.
So any direction on which operations account for this quota would be helpful. Anybody know what ExtractBytesPerDay quota means??
You quota seems to be related to Exporting Data from BQ that extract data to a destination like Cloud Storage.
You should know that there is a new API called BigQuery Storage API that can help to mitigate your quota issue.
In case that don't help you, you might want to contact the Google Cloud Platform Support to get more insights and know how to resolve your error.
Related
It's possible to set query size limits for BigQuery API on project and user-level, see https://cloud.google.com/bigquery/quotas
As I understand it, this includes BQML. The costs between BQ and BQML differ significantly, though. If we'd set a query size limit of 1 TB per user and day, this would allow the user to consume 1 TB with BQML, which results in costs of 250$, whereas for normal BQ query costs would be 5$.
Is there a way to set a user query size limit specifically for BQML?
Unfortunately for on-demand users there is no way to set a query limit specific for BQML CREATE MODEL statement.
We have transactional data in Cloud Storage encrypted with Custom Managed HSM key and when user submit jobs , a dataproc cluster spins and process some analytics on the data stored in the Cloud Storage.
The data is a transactional data partitioned on a per day basis and stored in parquet file ( a single file per day ) format . Every thing works good but behind the scenes there are extremely large number of calls calling Decrypt operation.
We have already increased the Quota to 250K operations per minute but it seems we are simply reaching this threshold . Currently we are forced to reduce the number of parallel jobs that we can process severely impacting our SLA and since we reached the Quota of KMS requests , other processing products like BIG Query whose data is also encrypted , fails to perform its operations , resulting request failures .
I am getting exception with message --> "Project 'XXX' exceeded limit for metric cloudkms.googleapis.com/hsm_symmetric_requests."
When looked at the logs , I could see enormous number of requests
request: {
#type: "type.googleapis.com/google.cloud.kms.v1.DecryptRequest"
name: <KeyPath>
}
from cloud storage account "service-XXXX#gs-project-accounts.iam.gserviceaccount.com"
I can understand storage account asking for decryption of data but 250K requests per minute (our current limit) for a single job which might be operating maximum 300 to 400 files is too high in my view.
Did somehow faced this kind of challenge ? How it can be handled as we planned to use large number of parallel jobs meaning there will be a need for quite a higher amount of "hsm_symmetric_requests" request quota. Also would be great if some one can explain the behavior .
Where/how can I easily see how many BigQuery analysis queries have been run per month. How about storage usage overall/changes-over-time(monthly)?
I've had a quick look at "Monitoring > Dashboards > Bigquery". Is that the best place to explore? It only seems to go back to early October - was that when it was released or does it only display the last X weeks of data? Trying metrics explorer for Queries Count (Metric:bigquery.googleapis.com/job/num_in_flight) was giving me a weird unlabelled y-axis, e.g. a scale of 0 to 0.08? Odd as I expect to see a few hundred queries run per week.
Context: It would be good to have a high level summary of BigQuery, as the the months progress, to give an idea to the wider organisation and management on the scale of usage.
You can track your bytes billed by exporting BigQuery usage logs.
Setup logs export (this is using the Legacy Logs Viewer)
Open Logging -> Logs Viewer
Click Create Sink
Enter "Sink Name"
For "Sink service" choose "BigQuery dataset"
Select your BigQuery dataset to monitor
Create sink
Create sink
Once Logs is enabled, all queries to be executed will store data usage logs in table "cloudaudit_googleapis_com_data_access_YYYYMMDD" under the BigQuery dataset you selected in your sink.
Created cloudaudit_googleapis_com_* tables
Here is a sample query to get bytes used per user
#standardSQL
WITH data as
(
SELECT
protopayload_auditlog.authenticationInfo.principalEmail as principalEmail,
protopayload_auditlog.metadataJson AS metadataJson,
CAST(JSON_EXTRACT_SCALAR(protopayload_auditlog.metadataJson,
"$.jobChange.job.jobStats.queryStats.totalBilledBytes") AS INT64) AS totalBilledBytes,
FROM
`myproject_id.training_big_query.cloudaudit_googleapis_com_data_access_*`
)
SELECT
principalEmail,
SUM(totalBilledBytes) AS billed_bytes
FROM
data
WHERE
JSON_EXTRACT_SCALAR(metadataJson, "$.jobChange.job.jobConfig.type") = "QUERY"
GROUP BY principalEmail
ORDER BY billed_bytes DESC
Query results
NOTES:
You can only track the usage starting at the date when you set up the logs export
Table "cloudaudit_googleapis_com_data_access_YYYYMMDD" is created daily to track all logs
I think Cloud Monitoring is the only place to create and view metrics. If you are not happy with what they provide for BigQuery by default, the only other alternative is to create your own customized carts and dashboards that satisfy your need. You can achieve that using Monitoring Query Language. Using MQL you can achieve the stuff you described in you question. Here are the links for more detailed information.
Introduction to BigQuery monitoring
Introduction to Monitoring Query Language
I want to create a dashboard/chart in Google Cloud Monitoring where I can see the total number of rows of my BigQuery table at all times.
With resource type "bigquery_dataset" and metric "uploaded_row_count" I only see the number of new rows per second with aligner "rate".
If I choose "sum" as aligner it only shows the number of new rows added for the chosen alignment period.
I'm probably missing something but how do I see the total number of rows of a table?
PubSub subscriptions have this option with metric "num_undelivered_messages" and also Dataflow jobs with "element_count".
Thanks in advance.
There's an ongoing feature request for BigQuery table attribute on GCP's Cloud Monitoring metrics, but there's no ETA when this feature be rolled out . Please star and comment if you wanted the feature to be implemented in the future.
Cloud Monitoring only charts and monitors any (numeric) metric data that your Google Cloud project collects. On this case, system metrics generated for BigQuery. By looking at the documentation, only the metric for uploaded rows are available, which has the behaviour that you're seeing in the chart. The total number of rows however is currently not available.
Therefore as of this writing, unfortunately what you want is not possible, due to the the Cloud Monitoring limitations on BigQuery, there are only work around that you can try to do.
For other readers who are ok with #Mikhail Berlyant comment, here is a thread for querying metadata including row counts.
The biggest chunk of my BigQuery billing comes from query consumption. I am trying to optimize this by understanding which datasets/tables consume the most.
I am therefore looking for a way to track my BigQuery usage, but ideally something that is more in realtime (that I don't have to wait a day before I get the final results). The best way would be for instance how much each table/dataset consumed in the last hour.
So far I managed to find the Dashboard Monitoring but this only allows to display the queries in flight per project and the stored bytes per table, which is not what I am after.
What other solutions are there to retrieve this kind of information?
Using Stackdriver logs, you could create a sink with Pub/Sub topic as target for real-time analysis that filter only BigQuery logs like this :
resource.type="bigquery_resource" AND
proto_payload.method_name="jobservice.jobcompleted" AND
proto_payload.service_data.job_completed_event.job.job_statistics.total_billed_bytes:*
(see example queries here : https://cloud.google.com/logging/docs/view/query-library?hl=en_US#bigquery-filters)
You could create the sink on a specific project, a folder or even an organization. This will retrieve all the queries done in BigQuery in that specific project, folder or organization.
The field proto_payload.service_data.job_completed_event.job.job_statistics.total_billed_bytes will give you the number of bytes processed by the query.
Based on on-demand BigQuery pricing (as of now, $5/TB for most regions, but check for your own region), you could easily estimate in real-time the billing. You could create a Dataflow job that aggregates the results in BigQuery, or simply consume the destination Pub/Sub topic with any job you want to make the pricing calculation :
jobPriceInUSD = totalBilledBytes / 1_000_000_000_000 * pricePerTB
because 1 TB = 1_000_000_000_000 B. As I said before, pricePerTB depends on regions (see : (https://cloud.google.com/bigquery/pricing#on_demand_pricing for the exact price). For example, as of time of writing :
$5/TB for us-east1
$6/TB for asia-northeast1
$9/TB for southamerica-east1
Also, for each month, as of now, the 1st TB is free.
It might be easier to use the INFORMATION_SCHEMA.JOBS_BY_* views because you don't have to set up the stackdriver logging and can use them right away.
Example taken & modified from How to monitor query costs in Google BigQuery
DECLARE gb_divisor INT64 DEFAULT 1024*1024*1024;
DECLARE tb_divisor INT64 DEFAULT gb_divisor*1024;
DECLARE cost_per_tb_in_dollar INT64 DEFAULT 5;
DECLARE cost_factor FLOAT64 DEFAULT cost_per_tb_in_dollar / tb_divisor;
SELECT
ROUND(SUM(total_bytes_processed) / gb_divisor,2) as bytes_processed_in_gb,
ROUND(SUM(IF(cache_hit != true, total_bytes_processed, 0)) * cost_factor,4) as cost_in_dollar,
user_email,
FROM (
(SELECT * FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_USER)
UNION ALL
(SELECT * FROM `other-project.region-us`.INFORMATION_SCHEMA.JOBS_BY_USER)
)
WHERE
DATE(creation_time) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) and CURRENT_DATE()
GROUP BY
user_email
Some caveats:
you need to UNION ALL all of the projects that you use explicitly
JOBS_BY_USER did not work for me on my private account (supposedly because me login email is #googlemail and big query stores my email as #gmail`)
the WHERE condition needs to be adjusted for your billing period (instead of the last 30 days)
doesn't provide the "bytes billed" information, so we need to determine those based on the cache usage
doesn't include the "if less than 10MB use 10MB" condition
data is only retained for the past 180 days
DECLARE cost_per_tb_in_dollar INT64 DEFAULT 5; reflects only US costs - other regions might have different costs - see https://cloud.google.com/bigquery/pricing#on_demand_pricing
you can only query one region at a time