BigQuery and Data Studio usage (cost) - google-cloud-platform

I wanted to view the BigQuery costs in my project. I am downloading logs to the table according to the following function:
resource.type = "bigquery_resource"
protoPayload.methodName = "jobservice.jobcompleted"
However, when I view the data, the information about refreshing the table in Data Studio does not reflect here. This data appears during the filter:
protoPayload.serviceName = "bigquerybiengine.googleapis.com"
However, at this point there is no use, only information about the access to the data range. How can I read the data consumption data when refreshing reports in Data Studio?

To analyze the cost of Data Studio report and query costs you can use Cloud Audit Log for Bigquery, making use of event data and export it to BigQuery to analyze it.
Create a sink from cloud logging, this will output all BigQuery query_job_completed log events from Cloud Audit Logging service into your Bigquery table.
When you have the BigQuery event data flowing into your dataset, you can create a view and query it. You will get totalBilledBytes per query which can be used to calculate the cost of queries.
You can refer to this documentation for further information.

Related

Does GCP Data Loss Prevention support publishing its results to Data Catalog for External Big Query Tables

I was trying to auto tag InfoTypes like PhoneNumber, EmailId on the data in GCS Bucket and Big Query External tables using Data Loss Prevention Tool in GCP so that i can have those tags at Data Catalog and subsequently in Dataplex. Now the problems are that
If i select any sources other than Big Query table (GCS, Data Store etc.), the option to publish GCP DLP inspection results to Data Catalog is disabled.
If i select Big Query table, Data Catalog publish option is enabled but when i try to run the inspection job, its errors out saying , "External tables are not supported for inspection". Surprisingly it supports only internal big query tables.
The question is that, is my understanding of GCP DLP - Data Catalog integration works only for Internal Big Query tables correct? Am doing something wrong here, GCP documentation doesn't mention these things either!
Also while configuring the Inspection Job from the DLP UI Console, i had to provide Big Query tableid mandatorily, is there a way i can run DLP inspection job against a BQ Dataset or a bunch of tables?
Regarding Data Loss Prevention Services in Google Cloud, your understanding is correct, data cannot be ex-filtrated by copying to services outside the perimeter, e.g., a public Google Cloud Storage (GCS) bucket or an external BigQuery table. Visit this URL for more reference.
Now, about how to run a DLP Inspection job against a BQ bunch of tables, there are 2 ways to do it:
Programmatically fetch the Big Query tables, query the table and call DLP Streaming Content API. It operates in real time, but it is expensive. Here I share the concept in a Java example code:
url =
String.format(
"jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthType=3;ProjectId=%s;",
projectId);
DataSource ds = new com.simba.googlebigquery.jdbc42.DataSource();
ds.setURL(url);
conn = ds.getConnection();
DatabaseMetaData databaseMetadata = conn.getMetaData();
ResultSet tablesResultSet =
databaseMetadata.getTables(conn.getCatalog(), null, "%", new String[]{"TABLE"});
while (tablesResultSet.next()) {
// Query your Table Data and call DLP Streaming API
}
Here is a tutorial for this method.
Programmatically fetch the Big Query tables, and then trigger one Inspect Job for each table. It is the cheapest method, but you need to consider that it's a batch operation, so it doesn’t execute in real time. Here is the concept in a Python example:
client = bigquery.Client()
datasets = list(client.list_datasets(project=project_id))
if datasets:
for dataset in datasets:
tables = client.list_tables(dataset.dataset_id)
for table in tables:
# Create Inspect Job for table.table_id
Use this thread for more reference on running a DLP Inspection job against a BQ bunch of tables.

BigQuery summary

Where/how can I easily see how many BigQuery analysis queries have been run per month. How about storage usage overall/changes-over-time(monthly)?
I've had a quick look at "Monitoring > Dashboards > Bigquery". Is that the best place to explore? It only seems to go back to early October - was that when it was released or does it only display the last X weeks of data? Trying metrics explorer for Queries Count (Metric:bigquery.googleapis.com/job/num_in_flight) was giving me a weird unlabelled y-axis, e.g. a scale of 0 to 0.08? Odd as I expect to see a few hundred queries run per week.
Context: It would be good to have a high level summary of BigQuery, as the the months progress, to give an idea to the wider organisation and management on the scale of usage.
You can track your bytes billed by exporting BigQuery usage logs.
Setup logs export (this is using the Legacy Logs Viewer)
Open Logging -> Logs Viewer
Click Create Sink
Enter "Sink Name"
For "Sink service" choose "BigQuery dataset"
Select your BigQuery dataset to monitor
Create sink
Create sink
Once Logs is enabled, all queries to be executed will store data usage logs in table "cloudaudit_googleapis_com_data_access_YYYYMMDD" under the BigQuery dataset you selected in your sink.
Created cloudaudit_googleapis_com_* tables
Here is a sample query to get bytes used per user
#standardSQL
WITH data as
(
SELECT
protopayload_auditlog.authenticationInfo.principalEmail as principalEmail,
protopayload_auditlog.metadataJson AS metadataJson,
CAST(JSON_EXTRACT_SCALAR(protopayload_auditlog.metadataJson,
"$.jobChange.job.jobStats.queryStats.totalBilledBytes") AS INT64) AS totalBilledBytes,
FROM
`myproject_id.training_big_query.cloudaudit_googleapis_com_data_access_*`
)
SELECT
principalEmail,
SUM(totalBilledBytes) AS billed_bytes
FROM
data
WHERE
JSON_EXTRACT_SCALAR(metadataJson, "$.jobChange.job.jobConfig.type") = "QUERY"
GROUP BY principalEmail
ORDER BY billed_bytes DESC
Query results
NOTES:
You can only track the usage starting at the date when you set up the logs export
Table "cloudaudit_googleapis_com_data_access_YYYYMMDD" is created daily to track all logs
I think Cloud Monitoring is the only place to create and view metrics. If you are not happy with what they provide for BigQuery by default, the only other alternative is to create your own customized carts and dashboards that satisfy your need. You can achieve that using Monitoring Query Language. Using MQL you can achieve the stuff you described in you question. Here are the links for more detailed information.
Introduction to BigQuery monitoring
Introduction to Monitoring Query Language

How to trigger auto update Google BigQuery Dataset every time CSV upload on Google Cloud Storage

I am trying to automate the entire data loading, that means whenever I upload a file to Google Cloud storage, it automatically triggers the data to be uploaded into the BigQuery dataset. I know that there is a daily set timing update available, but I want something where it only triggers whenever the CSV file is re-uploaded.
You have 2 possibilities:
Either you react on event. I mean you can plug a function on Google Cloud Storage events. In the event message you have the file stored in GCS and you can do what you want with it, for exemple run a load job from Google Cloud Storage.
Or, do nothing! Let the file in GCS and create a BigQuery federated table to read into GCS
With this 2 solutions, your data are accessible by BigQuery. Your Datastudio graph can query BigQuery, the data are here. However.
The load job is more efficient, you can partition and clusterize your data for optimize the speed and the cost. However, you duplicate your data (from GCS) and you have to code and to run your function. Anyway, cost is very low and function very simple. For Big Data it's my recommended solution
The federated table are very useful when the quantity of data is low and for occasional access or for prototyping. You can't clusterize and partition your data and the speed is lower than data loaded into BigQuery (because the CSV parsing is performing on the fly).
So, Big Data is a wide area: do you need to transform the data before the load? can you transform them after the log? How can you link query the ones after the others? ....
Don't hesitate if you have other questions on this!

How to find unused / unaccessed objects in BigQuery

Is there a way to find unused objects ( Tables, Views etc ) within datasets in BigQuery or objects less frequently accessed ( like we can run audits in Oracle to find out the same )
Just like you can run audits in Oracle, you can enable StackDriver logging on BigQuery and run audits from StackDriver.
If you'd also like to use BigQuery syntax to query StackDriver logging, you can export StackDriver logging to BigQuery.
You can take stackdriver loggings into BigQuery again and run the Audit queries against that table
Create Stackdriver Monitoring on the same using custom metrics.
Both of these incur costs.
However, BigQuery automatically lowers cost on the data stored in tables or in partitions that have not been modified in the last 90 days.
we can write a script to find the unused bq table through the logs file. This script works in a manner to find a bq tables list on which query has been done for the last (Number of days).
with obtained bq table list will compare with a final table list which contains a total list of tables.

How to retrive Bigquery billing details from GCP console or UI?

My team is using bigquery for our product development. Other bill of Rs 5159 got generated for one days transaction.
I checked the transaction details and
BigQuery Analysis: 15.912 Tebibytes [Currency conversion: USD to INR using rate 69.155]
Is is possible to somehow find out more details about the transactions like table name, queries that were executed and exact time of execution?
BigQuery automatically sends audit logs to Stackdriver Logging and provide the ability to do aggregated analysis on logs data. You can see BigQuery schema for exported logs for details
As quick example: Query cost breakdown by identity
This query shows estimated query costs by user identity. It estimates costs based on the list price for on-demand queries in the US. This pricing may not be accurate for other locations or for customers leveraging flat-rate billing.
#standardSQL
WITH data as
(
SELECT
protopayload_auditlog.authenticationInfo.principalEmail as principalEmail,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent AS jobCompletedEvent
FROM
`MYPROJECTID.MYDATASETID.cloudaudit_googleapis_com_data_access_YYYYMMDD`
)
SELECT
principalEmail,
FORMAT('%9.2f',5.0 * (SUM(jobCompletedEvent.job.jobStatistics.totalBilledBytes)/POWER(2, 40))) AS Estimated_USD_Cost
FROM
data
WHERE
jobCompletedEvent.eventName = 'query_job_completed'
GROUP BY principalEmail
ORDER BY Estimated_USD_Cost DESC
As of last year BigQuery provides INFORMATION_SCHEMA tables that also give access to job information via JOBS_BY_* views. The INFORMATION_SCHEMA.JOBS_BY_USER and INFORMATION_SCHEMA.JOBS_BY_PROJECT views even include the exact query alongside the processed bytes. It might not be 100% accuracte (because bytes processed != bytes billed) but it should allow you to gain a good overview over your costs, which queries triggered them an who the initiator was.
Example
SELECT
creation_time,
job_id,
project_id,
user_email,
total_bytes_processed,
query
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_USER
The most "efficient" way to keep an eye on the cost is using the INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION view as it automatically includes all projects of the organization. You need to be Organization Owner or Organization Administrator to use that view, though.
From there you can figure out which jobs were the most expensive (= get their job id) and form there drill down via JOBS_BY_PROJECT to get the exact query.
See https://www.pascallandau.com/bigquery-snippets/monitor-query-costs/ for a more comprehensive explanation.
You need to Export Billing Data to BigQuery
Tools for monitoring, analyzing and optimizing cost have become an important part of managing development. Billing export to BigQuery enables you to export your daily usage and cost estimates automatically throughout the day to
export data to a CSV,JSON file
However, if you use regular file export, you should be aware that regular file export captures a smaller dataset than export to BigQuery. For more information about regular file export and the data it captures, see Export Billing Data to a File.
to a BigQuery dataset you specify.
After you enable BigQuery export, it might take a few hours to start seeing your data. Billing data automatically exports your data to BigQuery in regular intervals, but the frequency of updates in BigQuery varies depending on the services you're using. Note that BigQuery loads are ACID compliant, so if you query the BigQuery billing export dataset while data is being loaded into it, you will not encounter partially loaded data.
Follow the step by step guide: How to enable billing export to BigQuery
https://cloud.google.com/billing/docs/how-to/export-data-bigquery