I want to set weekly Google Play transfer, but it can not be saved.
At first, I set daily a play-transfer job. It worked. I tried to change transfer frequency to weekly - every Monday 7:30 - got an error:
"This transfer config could not be saved. Please try again.
Invalid schedule [every mon 7:30]. Schedule has to be consistent with CustomScheduleGranularity [daily: true ].
I think this document shows it can change transfer frequency:
https://cloud.google.com/bigquery-transfer/docs/play-transfer
Can Google Play transfer be set to weekly?
By default transfer is created as daily. From the same docs:
Daily, at the time the transfer is first created (default)
Try to create brand new weekly transfer. If it works, I would think it is a web UI bug. Here are two other options to change your existing transfer:
BigQuery command-line tool: bq update --transfer_config
Very limited number of options are available, and schedule is not available for update.
BigQuery Data Transfer API: transferConfigs.patch Most transfer options are updatable. Easy way to try it is with API Explorer. Details on transferconfig object. schedule field need to be defined:
Data transfer schedule. If the data source does not support a custom
schedule, this should be empty. If it is empty, the default value for
the data source will be used. The specified times are in UTC. Examples
of valid format: 1st,3rd monday of month 15:30, every wed,fri of
jan,jun 13:15, and first sunday of quarter 00:00. See more explanation
about the format here:
https://cloud.google.com/appengine/docs/flexible/python/scheduling-jobs-with-cron-yaml#the_schedule_format
NOTE: the granularity should be at least 8 hours, or less frequent.
Related
Where/how can I easily see how many BigQuery analysis queries have been run per month. How about storage usage overall/changes-over-time(monthly)?
I've had a quick look at "Monitoring > Dashboards > Bigquery". Is that the best place to explore? It only seems to go back to early October - was that when it was released or does it only display the last X weeks of data? Trying metrics explorer for Queries Count (Metric:bigquery.googleapis.com/job/num_in_flight) was giving me a weird unlabelled y-axis, e.g. a scale of 0 to 0.08? Odd as I expect to see a few hundred queries run per week.
Context: It would be good to have a high level summary of BigQuery, as the the months progress, to give an idea to the wider organisation and management on the scale of usage.
You can track your bytes billed by exporting BigQuery usage logs.
Setup logs export (this is using the Legacy Logs Viewer)
Open Logging -> Logs Viewer
Click Create Sink
Enter "Sink Name"
For "Sink service" choose "BigQuery dataset"
Select your BigQuery dataset to monitor
Create sink
Create sink
Once Logs is enabled, all queries to be executed will store data usage logs in table "cloudaudit_googleapis_com_data_access_YYYYMMDD" under the BigQuery dataset you selected in your sink.
Created cloudaudit_googleapis_com_* tables
Here is a sample query to get bytes used per user
#standardSQL
WITH data as
(
SELECT
protopayload_auditlog.authenticationInfo.principalEmail as principalEmail,
protopayload_auditlog.metadataJson AS metadataJson,
CAST(JSON_EXTRACT_SCALAR(protopayload_auditlog.metadataJson,
"$.jobChange.job.jobStats.queryStats.totalBilledBytes") AS INT64) AS totalBilledBytes,
FROM
`myproject_id.training_big_query.cloudaudit_googleapis_com_data_access_*`
)
SELECT
principalEmail,
SUM(totalBilledBytes) AS billed_bytes
FROM
data
WHERE
JSON_EXTRACT_SCALAR(metadataJson, "$.jobChange.job.jobConfig.type") = "QUERY"
GROUP BY principalEmail
ORDER BY billed_bytes DESC
Query results
NOTES:
You can only track the usage starting at the date when you set up the logs export
Table "cloudaudit_googleapis_com_data_access_YYYYMMDD" is created daily to track all logs
I think Cloud Monitoring is the only place to create and view metrics. If you are not happy with what they provide for BigQuery by default, the only other alternative is to create your own customized carts and dashboards that satisfy your need. You can achieve that using Monitoring Query Language. Using MQL you can achieve the stuff you described in you question. Here are the links for more detailed information.
Introduction to BigQuery monitoring
Introduction to Monitoring Query Language
I want to retrieve data from BigQuery that arrived every hour and do some processing and pull the new calculate variables in a new BigQuery table. The things is that I've never worked with gcp before and I have to for my job now.
I already have my code in python to process the data but it's work only with a "static" dataset
As your source and sink of that are both in BigQuery, I would recommend you to do your transformations inside BigQuery.
If you need a scheduled job that runs in a pre determined time, you can use Scheduled Queries.
With Scheduled Queries you are able to save some query, execute it periodically and save the results to another table.
To create a scheduled query follow the steps:
In BigQuery Console, write your query
After writing the correct query, click in Schedule query and then in Create new scheduled query as you can see in the image below
Pay attention in this two fields:
Schedule options: there are some pre-configured schedules such as daily, monthly, etc.. If you need to execute it every two hours, for example, you can set the Repeat option as Custom and set your Custom schedule as 'every 2 hours'. In the Start date and run time field, select the time and data when your query should start being executed.
Destination for query results: here you can set the dataset and table where your query's results will be saved. Please keep in mind that this option is not available if you use scripting. In other words, you should use only SQL and not scripting in your transformations.
Click on Schedule
After that your query will start being executed according to your schedule and destination table configurations.
According with Google recommendation, when your data are in BigQuery and when you want to transform them to store them in BigQuery, it's always quicker and cheaper to do this in BigQuery if you can express your processing in SQL.
That's why, I don't recommend you dataflow for your use case. If you don't want, or you can't use directly the SQL, you can create User Defined Function (UDF) in BigQuery in Javascript.
EDIT
If you have no information when the data are updated into BigQuery, Dataflow won't help you on this. Dataflow can process realtime data only if these data are present into PubSub. If not, it's not magic!!
Because you haven't the information of when a load is performed, you have to run your process on a schedule. For this, Scheduled Queries is the right solution is you use BigQuery for your processing.
The biggest chunk of my BigQuery billing comes from query consumption. I am trying to optimize this by understanding which datasets/tables consume the most.
I am therefore looking for a way to track my BigQuery usage, but ideally something that is more in realtime (that I don't have to wait a day before I get the final results). The best way would be for instance how much each table/dataset consumed in the last hour.
So far I managed to find the Dashboard Monitoring but this only allows to display the queries in flight per project and the stored bytes per table, which is not what I am after.
What other solutions are there to retrieve this kind of information?
Using Stackdriver logs, you could create a sink with Pub/Sub topic as target for real-time analysis that filter only BigQuery logs like this :
resource.type="bigquery_resource" AND
proto_payload.method_name="jobservice.jobcompleted" AND
proto_payload.service_data.job_completed_event.job.job_statistics.total_billed_bytes:*
(see example queries here : https://cloud.google.com/logging/docs/view/query-library?hl=en_US#bigquery-filters)
You could create the sink on a specific project, a folder or even an organization. This will retrieve all the queries done in BigQuery in that specific project, folder or organization.
The field proto_payload.service_data.job_completed_event.job.job_statistics.total_billed_bytes will give you the number of bytes processed by the query.
Based on on-demand BigQuery pricing (as of now, $5/TB for most regions, but check for your own region), you could easily estimate in real-time the billing. You could create a Dataflow job that aggregates the results in BigQuery, or simply consume the destination Pub/Sub topic with any job you want to make the pricing calculation :
jobPriceInUSD = totalBilledBytes / 1_000_000_000_000 * pricePerTB
because 1 TB = 1_000_000_000_000 B. As I said before, pricePerTB depends on regions (see : (https://cloud.google.com/bigquery/pricing#on_demand_pricing for the exact price). For example, as of time of writing :
$5/TB for us-east1
$6/TB for asia-northeast1
$9/TB for southamerica-east1
Also, for each month, as of now, the 1st TB is free.
It might be easier to use the INFORMATION_SCHEMA.JOBS_BY_* views because you don't have to set up the stackdriver logging and can use them right away.
Example taken & modified from How to monitor query costs in Google BigQuery
DECLARE gb_divisor INT64 DEFAULT 1024*1024*1024;
DECLARE tb_divisor INT64 DEFAULT gb_divisor*1024;
DECLARE cost_per_tb_in_dollar INT64 DEFAULT 5;
DECLARE cost_factor FLOAT64 DEFAULT cost_per_tb_in_dollar / tb_divisor;
SELECT
ROUND(SUM(total_bytes_processed) / gb_divisor,2) as bytes_processed_in_gb,
ROUND(SUM(IF(cache_hit != true, total_bytes_processed, 0)) * cost_factor,4) as cost_in_dollar,
user_email,
FROM (
(SELECT * FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_USER)
UNION ALL
(SELECT * FROM `other-project.region-us`.INFORMATION_SCHEMA.JOBS_BY_USER)
)
WHERE
DATE(creation_time) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) and CURRENT_DATE()
GROUP BY
user_email
Some caveats:
you need to UNION ALL all of the projects that you use explicitly
JOBS_BY_USER did not work for me on my private account (supposedly because me login email is #googlemail and big query stores my email as #gmail`)
the WHERE condition needs to be adjusted for your billing period (instead of the last 30 days)
doesn't provide the "bytes billed" information, so we need to determine those based on the cache usage
doesn't include the "if less than 10MB use 10MB" condition
data is only retained for the past 180 days
DECLARE cost_per_tb_in_dollar INT64 DEFAULT 5; reflects only US costs - other regions might have different costs - see https://cloud.google.com/bigquery/pricing#on_demand_pricing
you can only query one region at a time
We have a campaign management system. We create and run campaigns on various channels. When user clicks/accesses any of the Adv (as part of campaign), system generates a log. Our system is hosted in GCP. Using ‘Exports’ feature logs are exported to BigQuery
In BigQuery the Log Table is partitioned using ‘timestamp’ field (time when log is generated). We understand that BigQuery stores dates in UTC timezone and so partitions are also based on UTC time
Using this Log Table, We need to generate Reports per day. Reports can be like number of impressions per each day per campaign. And we need to show these reports as per ETC time.
Because the BigQuery table is partitioned by UTC timezone, query for ETC day would potentially need to scan multiple partitions. Had any one addressed this issue or have suggestions to optimise the storage and query so that its takes complete advantage of BigQuery partition feature
We are planning to use GCP Data studio for Reports.
BigQuery should be smart enough to filter for the correct timezones when dealing with partitions.
For example:
SELECT MIN(datehour) time_start, MAX(datehour) time_end, ANY_VALUE(title) title
FROM `fh-bigquery.wikipedia_v3.pageviews_2018` a
WHERE DATE(datehour) = '2018-01-03'
5.0s elapsed, 4.56 GB processed
For this query we processed the 4.56GB in the 2018-01-03 partition. What if we want to adjust for a day in the US? Let's add this in the WHERE clause:
WHERE DATE(datehour, "America/Los_Angeles") = '2018-01-03'
4.4s elapsed, 9.04 GB processed
Now this query is automatically scanning 2 partitions, as it needs to go across days. For me this is good enough, as BigQuery is able to automatically figure this out.
But what if you wanted to permanently optimize for one timezone? You could create a generated, shifted DATE column - and use that one to PARTITION for.
I've enabled usage export setting for my google cloud compute engine platform and set a bucket for destination storage.
It has been almost 27 hours but the report it not there in the bucket.
I've read this doc and followed exact same steps.
I've checked the status with gcloud using following code
gcloud compute project-info describe
It shows correct bucket name in usageExportLocation.
Does the storage class of bucket matters? I've coldline storage class bucket.
It has started working.
If you have requested the usage export report during 1st day of the month, It will start exporting usage of 2nd day at 3rd day. So you will not start getting report of the same exact day from another day, you will have to wait one more day.
But of course the report for requested day will be inside the total usage report of current month as they will provide following two formats.
Daily Report
Current Month Report
As stated(below) in the help center article about 'Setting up usage export';
" When you first enable the usage export feature, the first report will
be sent the following day, detailing the previous day's usage.
Afterwards, you will receive reports in 24 hour intervals."
This is an expected behavior, when you first enable the usage export feature.