We have BigQuery instances with various datasets for each datasets we want to monitor the usage,
like Number of Queries per datasets, Queries fired for each datasets, Number of users accessing the datasets.
Is there any way in which we can monitor BigQuery usage?
You can see some metrics here:
https://console.cloud.google.com/monitoring/dashboards/resourceList/bigquery_dataset?project=**[YOUR_PROJECTID_GOES_HERE]**
Some more info here as well: https://cloud.google.com/bigquery/docs/monitoring
You can also enable BigQuery audit logs, and query the audit tables to get some insights https://cloud.google.com/bigquery/docs/reference/auditlogs.
Probably to monitor users, queries and other fine-grained monitoring you will only be able to do so using the audit logs
Most likely the best choice here is to simply query the job metadata directly in aggregate, through the relevant INFORMATION_SCHEMA views.
See https://cloud.google.com/bigquery/docs/information-schema-jobs for details about the job views, which includes some simple query examples at the end.
The jobs views do provide a list of referenced_tables, and you can identify the encapsulating data from them. You'll likely need to consider how you report on queries that reference multiple datasets, particularly if you are reporting on metrics like bytes scanned or resources utilized.
Related
Based on my research, the easiest and the most straight forward way to get metadata out of Glue's Data Catalog, is using Athena and querying the information_schema database. The article below has come up frequently in my research and is written by Amazon's team:
Querying AWS Glue Data Catalog
However, under the section titled Considerations and limitations the following is written:
Querying information_schema is most performant if you have a small to moderate amount of AWS Glue metadata. If you have a large amount of metadata, errors can occur.
Unfortunately, in this article, there do not seem to be any indications or suggestion regarding what constitutes as "large amount of metadata" and exactly what errors could occur when the metadata is large and one needs to query the metadata.
My question is, how to deal with the issue related to the ever growing size of data catalog's metadata so that one would never encounter errors when using Athena to query the metadata?
Is there a best practice for this? Or perhaps a better solution for getting the same metadata that querying the catalog using Athena provides without multiple or great many API calls (using boto3, Hive DDL etc)?
I talked to AWS Support and did some research on this. Here's what I gathered:
The information_schema is built at query execution time, there doesn't seem to be any caching.
If you access information_schema.tables, it will make separate calls for each schema you have to the Hive Metastore (Glue Data Catalog).
If you access information_schema.columns, it will make separate calls for each schema and each table in that schema you have to the Hive Metastore.
These queries are affected by the general service quotas. In this case, DML queries like your select must finish within 30 minutes.
If your Glue Data Catalog has many thousands of schemas, tables, and columns all of this may result in slow performance. As a rough guesstimate support told me that you should be fine as long as you have less than ~ 10000 tables, which should be the case for most people.
I am working on a pipeline that takes data and do some partitioning on it, I am trying to load some data into bq table on gcp, but I got Too many partitions produced by query, allowed 4000, query produces at least 10000 partitions, I understand that it's a limitation by bq, and have found multiple purposed solutions to create a cluster on the data or partition by week instead of day, The problem is that I have no visibility on the data itself, I can not do this. if any other ideas are there please help.
Also, for sake of investigation and analysis, how to know how many big query jobs is submitted? is there a way to get the number of bq jobs submitted by specific dataflow?
Thannks
You can view the jobs created by a particular Dataflow job by navigating to the Google Cloud Console and clicking through to the Dataflow Job UI. Here is the relevant documentation with screenshots.
Where/how can I easily see how many BigQuery analysis queries have been run per month. How about storage usage overall/changes-over-time(monthly)?
I've had a quick look at "Monitoring > Dashboards > Bigquery". Is that the best place to explore? It only seems to go back to early October - was that when it was released or does it only display the last X weeks of data? Trying metrics explorer for Queries Count (Metric:bigquery.googleapis.com/job/num_in_flight) was giving me a weird unlabelled y-axis, e.g. a scale of 0 to 0.08? Odd as I expect to see a few hundred queries run per week.
Context: It would be good to have a high level summary of BigQuery, as the the months progress, to give an idea to the wider organisation and management on the scale of usage.
You can track your bytes billed by exporting BigQuery usage logs.
Setup logs export (this is using the Legacy Logs Viewer)
Open Logging -> Logs Viewer
Click Create Sink
Enter "Sink Name"
For "Sink service" choose "BigQuery dataset"
Select your BigQuery dataset to monitor
Create sink
Create sink
Once Logs is enabled, all queries to be executed will store data usage logs in table "cloudaudit_googleapis_com_data_access_YYYYMMDD" under the BigQuery dataset you selected in your sink.
Created cloudaudit_googleapis_com_* tables
Here is a sample query to get bytes used per user
#standardSQL
WITH data as
(
SELECT
protopayload_auditlog.authenticationInfo.principalEmail as principalEmail,
protopayload_auditlog.metadataJson AS metadataJson,
CAST(JSON_EXTRACT_SCALAR(protopayload_auditlog.metadataJson,
"$.jobChange.job.jobStats.queryStats.totalBilledBytes") AS INT64) AS totalBilledBytes,
FROM
`myproject_id.training_big_query.cloudaudit_googleapis_com_data_access_*`
)
SELECT
principalEmail,
SUM(totalBilledBytes) AS billed_bytes
FROM
data
WHERE
JSON_EXTRACT_SCALAR(metadataJson, "$.jobChange.job.jobConfig.type") = "QUERY"
GROUP BY principalEmail
ORDER BY billed_bytes DESC
Query results
NOTES:
You can only track the usage starting at the date when you set up the logs export
Table "cloudaudit_googleapis_com_data_access_YYYYMMDD" is created daily to track all logs
I think Cloud Monitoring is the only place to create and view metrics. If you are not happy with what they provide for BigQuery by default, the only other alternative is to create your own customized carts and dashboards that satisfy your need. You can achieve that using Monitoring Query Language. Using MQL you can achieve the stuff you described in you question. Here are the links for more detailed information.
Introduction to BigQuery monitoring
Introduction to Monitoring Query Language
Is there a way to find unused objects ( Tables, Views etc ) within datasets in BigQuery or objects less frequently accessed ( like we can run audits in Oracle to find out the same )
Just like you can run audits in Oracle, you can enable StackDriver logging on BigQuery and run audits from StackDriver.
If you'd also like to use BigQuery syntax to query StackDriver logging, you can export StackDriver logging to BigQuery.
You can take stackdriver loggings into BigQuery again and run the Audit queries against that table
Create Stackdriver Monitoring on the same using custom metrics.
Both of these incur costs.
However, BigQuery automatically lowers cost on the data stored in tables or in partitions that have not been modified in the last 90 days.
we can write a script to find the unused bq table through the logs file. This script works in a manner to find a bq tables list on which query has been done for the last (Number of days).
with obtained bq table list will compare with a final table list which contains a total list of tables.
My team is using bigquery for our product development. Other bill of Rs 5159 got generated for one days transaction.
I checked the transaction details and
BigQuery Analysis: 15.912 Tebibytes [Currency conversion: USD to INR using rate 69.155]
Is is possible to somehow find out more details about the transactions like table name, queries that were executed and exact time of execution?
BigQuery automatically sends audit logs to Stackdriver Logging and provide the ability to do aggregated analysis on logs data. You can see BigQuery schema for exported logs for details
As quick example: Query cost breakdown by identity
This query shows estimated query costs by user identity. It estimates costs based on the list price for on-demand queries in the US. This pricing may not be accurate for other locations or for customers leveraging flat-rate billing.
#standardSQL
WITH data as
(
SELECT
protopayload_auditlog.authenticationInfo.principalEmail as principalEmail,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent AS jobCompletedEvent
FROM
`MYPROJECTID.MYDATASETID.cloudaudit_googleapis_com_data_access_YYYYMMDD`
)
SELECT
principalEmail,
FORMAT('%9.2f',5.0 * (SUM(jobCompletedEvent.job.jobStatistics.totalBilledBytes)/POWER(2, 40))) AS Estimated_USD_Cost
FROM
data
WHERE
jobCompletedEvent.eventName = 'query_job_completed'
GROUP BY principalEmail
ORDER BY Estimated_USD_Cost DESC
As of last year BigQuery provides INFORMATION_SCHEMA tables that also give access to job information via JOBS_BY_* views. The INFORMATION_SCHEMA.JOBS_BY_USER and INFORMATION_SCHEMA.JOBS_BY_PROJECT views even include the exact query alongside the processed bytes. It might not be 100% accuracte (because bytes processed != bytes billed) but it should allow you to gain a good overview over your costs, which queries triggered them an who the initiator was.
Example
SELECT
creation_time,
job_id,
project_id,
user_email,
total_bytes_processed,
query
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_USER
The most "efficient" way to keep an eye on the cost is using the INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION view as it automatically includes all projects of the organization. You need to be Organization Owner or Organization Administrator to use that view, though.
From there you can figure out which jobs were the most expensive (= get their job id) and form there drill down via JOBS_BY_PROJECT to get the exact query.
See https://www.pascallandau.com/bigquery-snippets/monitor-query-costs/ for a more comprehensive explanation.
You need to Export Billing Data to BigQuery
Tools for monitoring, analyzing and optimizing cost have become an important part of managing development. Billing export to BigQuery enables you to export your daily usage and cost estimates automatically throughout the day to
export data to a CSV,JSON file
However, if you use regular file export, you should be aware that regular file export captures a smaller dataset than export to BigQuery. For more information about regular file export and the data it captures, see Export Billing Data to a File.
to a BigQuery dataset you specify.
After you enable BigQuery export, it might take a few hours to start seeing your data. Billing data automatically exports your data to BigQuery in regular intervals, but the frequency of updates in BigQuery varies depending on the services you're using. Note that BigQuery loads are ACID compliant, so if you query the BigQuery billing export dataset while data is being loaded into it, you will not encounter partially loaded data.
Follow the step by step guide: How to enable billing export to BigQuery
https://cloud.google.com/billing/docs/how-to/export-data-bigquery