How to find unused / unaccessed objects in BigQuery - google-cloud-platform

Is there a way to find unused objects ( Tables, Views etc ) within datasets in BigQuery or objects less frequently accessed ( like we can run audits in Oracle to find out the same )

Just like you can run audits in Oracle, you can enable StackDriver logging on BigQuery and run audits from StackDriver.
If you'd also like to use BigQuery syntax to query StackDriver logging, you can export StackDriver logging to BigQuery.

You can take stackdriver loggings into BigQuery again and run the Audit queries against that table
Create Stackdriver Monitoring on the same using custom metrics.
Both of these incur costs.
However, BigQuery automatically lowers cost on the data stored in tables or in partitions that have not been modified in the last 90 days.

we can write a script to find the unused bq table through the logs file. This script works in a manner to find a bq tables list on which query has been done for the last (Number of days).
with obtained bq table list will compare with a final table list which contains a total list of tables.

Related

BigQuery and Data Studio usage (cost)

I wanted to view the BigQuery costs in my project. I am downloading logs to the table according to the following function:
resource.type = "bigquery_resource"
protoPayload.methodName = "jobservice.jobcompleted"
However, when I view the data, the information about refreshing the table in Data Studio does not reflect here. This data appears during the filter:
protoPayload.serviceName = "bigquerybiengine.googleapis.com"
However, at this point there is no use, only information about the access to the data range. How can I read the data consumption data when refreshing reports in Data Studio?
To analyze the cost of Data Studio report and query costs you can use Cloud Audit Log for Bigquery, making use of event data and export it to BigQuery to analyze it.
Create a sink from cloud logging, this will output all BigQuery query_job_completed log events from Cloud Audit Logging service into your Bigquery table.
When you have the BigQuery event data flowing into your dataset, you can create a view and query it. You will get totalBilledBytes per query which can be used to calculate the cost of queries.
You can refer to this documentation for further information.

Monitor BigQuery Performances

We have BigQuery instances with various datasets for each datasets we want to monitor the usage,
like Number of Queries per datasets, Queries fired for each datasets, Number of users accessing the datasets.
Is there any way in which we can monitor BigQuery usage?
You can see some metrics here:
https://console.cloud.google.com/monitoring/dashboards/resourceList/bigquery_dataset?project=**[YOUR_PROJECTID_GOES_HERE]**
Some more info here as well: https://cloud.google.com/bigquery/docs/monitoring
You can also enable BigQuery audit logs, and query the audit tables to get some insights https://cloud.google.com/bigquery/docs/reference/auditlogs.
Probably to monitor users, queries and other fine-grained monitoring you will only be able to do so using the audit logs
Most likely the best choice here is to simply query the job metadata directly in aggregate, through the relevant INFORMATION_SCHEMA views.
See https://cloud.google.com/bigquery/docs/information-schema-jobs for details about the job views, which includes some simple query examples at the end.
The jobs views do provide a list of referenced_tables, and you can identify the encapsulating data from them. You'll likely need to consider how you report on queries that reference multiple datasets, particularly if you are reporting on metrics like bytes scanned or resources utilized.

Speed up BigQuery query job to import from Cloud SQL

I am performing a query to generate a new BigQuery table of of size ~1 Tb (a few billion rows), as part of migrating a Cloud SQL table to BigQuery, using Federated query. I use the BigQuery Python client to submit the query job, in the query I select all from the the Cloud SQL database table and use EXTERNAL_QUERY.
I find that the query can take 6+ hours (and fails with "Operation timed out after 6.0 hour")! Even if it didn't fail, I would like to speed it up as I may need to perform this migration again.
I see that the PostgreSQL egress is 20Mb/sec, consistent with a job that would take half a day. Would it help if I consider something more distributed with Dataflow? Or simpler, extend my Python code using the BigQuery client to generate multiple queries, which can run async by BigQuery?
Or is it possible to still use that single query but increase the egress traffic (database configuration)?
I think it is more suitable to use the dump export.
Running a query on large table is an inefficient job.
I recommend to export Cloud SQL data to a CSV file.
BigQuery can import CSV format file, So you can use this file to create your new bigQuery table.
I'm not sure of how long this job will takes, But at least will not be failed.
Refer here to get more detailed job about export Cloud SQL to CSV dump.

BigQuery summary

Where/how can I easily see how many BigQuery analysis queries have been run per month. How about storage usage overall/changes-over-time(monthly)?
I've had a quick look at "Monitoring > Dashboards > Bigquery". Is that the best place to explore? It only seems to go back to early October - was that when it was released or does it only display the last X weeks of data? Trying metrics explorer for Queries Count (Metric:bigquery.googleapis.com/job/num_in_flight) was giving me a weird unlabelled y-axis, e.g. a scale of 0 to 0.08? Odd as I expect to see a few hundred queries run per week.
Context: It would be good to have a high level summary of BigQuery, as the the months progress, to give an idea to the wider organisation and management on the scale of usage.
You can track your bytes billed by exporting BigQuery usage logs.
Setup logs export (this is using the Legacy Logs Viewer)
Open Logging -> Logs Viewer
Click Create Sink
Enter "Sink Name"
For "Sink service" choose "BigQuery dataset"
Select your BigQuery dataset to monitor
Create sink
Create sink
Once Logs is enabled, all queries to be executed will store data usage logs in table "cloudaudit_googleapis_com_data_access_YYYYMMDD" under the BigQuery dataset you selected in your sink.
Created cloudaudit_googleapis_com_* tables
Here is a sample query to get bytes used per user
#standardSQL
WITH data as
(
SELECT
protopayload_auditlog.authenticationInfo.principalEmail as principalEmail,
protopayload_auditlog.metadataJson AS metadataJson,
CAST(JSON_EXTRACT_SCALAR(protopayload_auditlog.metadataJson,
"$.jobChange.job.jobStats.queryStats.totalBilledBytes") AS INT64) AS totalBilledBytes,
FROM
`myproject_id.training_big_query.cloudaudit_googleapis_com_data_access_*`
)
SELECT
principalEmail,
SUM(totalBilledBytes) AS billed_bytes
FROM
data
WHERE
JSON_EXTRACT_SCALAR(metadataJson, "$.jobChange.job.jobConfig.type") = "QUERY"
GROUP BY principalEmail
ORDER BY billed_bytes DESC
Query results
NOTES:
You can only track the usage starting at the date when you set up the logs export
Table "cloudaudit_googleapis_com_data_access_YYYYMMDD" is created daily to track all logs
I think Cloud Monitoring is the only place to create and view metrics. If you are not happy with what they provide for BigQuery by default, the only other alternative is to create your own customized carts and dashboards that satisfy your need. You can achieve that using Monitoring Query Language. Using MQL you can achieve the stuff you described in you question. Here are the links for more detailed information.
Introduction to BigQuery monitoring
Introduction to Monitoring Query Language

How to retrive Bigquery billing details from GCP console or UI?

My team is using bigquery for our product development. Other bill of Rs 5159 got generated for one days transaction.
I checked the transaction details and
BigQuery Analysis: 15.912 Tebibytes [Currency conversion: USD to INR using rate 69.155]
Is is possible to somehow find out more details about the transactions like table name, queries that were executed and exact time of execution?
BigQuery automatically sends audit logs to Stackdriver Logging and provide the ability to do aggregated analysis on logs data. You can see BigQuery schema for exported logs for details
As quick example: Query cost breakdown by identity
This query shows estimated query costs by user identity. It estimates costs based on the list price for on-demand queries in the US. This pricing may not be accurate for other locations or for customers leveraging flat-rate billing.
#standardSQL
WITH data as
(
SELECT
protopayload_auditlog.authenticationInfo.principalEmail as principalEmail,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent AS jobCompletedEvent
FROM
`MYPROJECTID.MYDATASETID.cloudaudit_googleapis_com_data_access_YYYYMMDD`
)
SELECT
principalEmail,
FORMAT('%9.2f',5.0 * (SUM(jobCompletedEvent.job.jobStatistics.totalBilledBytes)/POWER(2, 40))) AS Estimated_USD_Cost
FROM
data
WHERE
jobCompletedEvent.eventName = 'query_job_completed'
GROUP BY principalEmail
ORDER BY Estimated_USD_Cost DESC
As of last year BigQuery provides INFORMATION_SCHEMA tables that also give access to job information via JOBS_BY_* views. The INFORMATION_SCHEMA.JOBS_BY_USER and INFORMATION_SCHEMA.JOBS_BY_PROJECT views even include the exact query alongside the processed bytes. It might not be 100% accuracte (because bytes processed != bytes billed) but it should allow you to gain a good overview over your costs, which queries triggered them an who the initiator was.
Example
SELECT
creation_time,
job_id,
project_id,
user_email,
total_bytes_processed,
query
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_USER
The most "efficient" way to keep an eye on the cost is using the INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION view as it automatically includes all projects of the organization. You need to be Organization Owner or Organization Administrator to use that view, though.
From there you can figure out which jobs were the most expensive (= get their job id) and form there drill down via JOBS_BY_PROJECT to get the exact query.
See https://www.pascallandau.com/bigquery-snippets/monitor-query-costs/ for a more comprehensive explanation.
You need to Export Billing Data to BigQuery
Tools for monitoring, analyzing and optimizing cost have become an important part of managing development. Billing export to BigQuery enables you to export your daily usage and cost estimates automatically throughout the day to
export data to a CSV,JSON file
However, if you use regular file export, you should be aware that regular file export captures a smaller dataset than export to BigQuery. For more information about regular file export and the data it captures, see Export Billing Data to a File.
to a BigQuery dataset you specify.
After you enable BigQuery export, it might take a few hours to start seeing your data. Billing data automatically exports your data to BigQuery in regular intervals, but the frequency of updates in BigQuery varies depending on the services you're using. Note that BigQuery loads are ACID compliant, so if you query the BigQuery billing export dataset while data is being loaded into it, you will not encounter partially loaded data.
Follow the step by step guide: How to enable billing export to BigQuery
https://cloud.google.com/billing/docs/how-to/export-data-bigquery