We need a way to automatically create a Pub/Sub trigger on new compute images (preferably triggered on a specific image family). Alternatively, we know that Pub/Sub on GCS buckets, but we have not found a way to automate transferring images to a GCS bucket.
For some background: we are automating image baking through packer and we need this piece to trigger a terraform creation. We know that a cron job can be created to simply poll images when they are created, but we are wondering if there is already support for such a trigger in GCP.
You can have a Stackdriver Logging export sink that publishes to Pub/Sub and is triggered by a specific filter (docs). For example:
resource.type="gce_image"
jsonPayload.event_subtype="compute.images.insert"
jsonPayload.event_type="GCE_OPERATION_DONE"
To trigger it only for a specific family you can use this other filter below but protoPayload.request.family is only present when the API request is received and not when it is actually fulfilled (maybe you could add some delay in your processing function if needed)
resource.type="gce_image"
protoPayload.request."#type"="type.googleapis.com/compute.images.insert"
protoPayload.request.family="FAMILY-NAME"
Another solution would be to create a cloud function with --trigger-topic={your pub sub topic} and then filter only the images that you want to act on based on some environment variables on the cloud function
Psuedo code
1. create a pub sub topic for images being inserted in the GCR
gcloud pubsub topics create projects/<project_id>/topics/gcr
This will now publish all messages corresponding to all images being inserted/modified/deleted in the repo
Create a cloud function that has the function signature thus
// contents of index.js
// use the Storage function from google-coud node js api to work on storages
// https://www.npmjs.com/package/#google-cloud/storage
const Storage = require(#google-cloud/storage).Storage;
function moveToStorageBucket(pubSubEvents, context, callback) {
/* this is how the pubsub comes from GCR
{"data":{"#type":"... .v1.PuSubMessage", "attribute":null, "data": "<base 64 encoded>"},
"context":{..other details}}
data that is base 64 encoded in in this format
{ "action":"INSERT","digest":"<image name>","tag":<"tag name>"}
*/
const data = JSON.parse(Buffer.from(pubSubEvents.data, 'base64').toString())
// get image name from the environment variable passed
const IMAGE_NAME = process.env.IMAGE_NAME;
if (data.digest.indexOf(IMAGE_NAME) !== -1) {
// your action here...
}
}
module.exports.moveToStorageBucket = moveToStorageBucket;
deploy the cloud function
gcloud functions deploy <function_name> --region <region> --runtime=nodejs8 --trigger-topic=<topic created> --entry-point=moveToStorageBucket --set-env-vars=^--^IMAGE_NAME=<your image name>
Hope that helps
Related
So I have a very simple python script that writes a txt-file to my google storage bucket.
I just want to set this job to run each hour i.e not based on a trigger. It seems like that when using SDK, it needs to have a --triger- flag, but I only want it to be "triggered" by the scheduler.
Is that possible?
You can create a Cloud Function with Pub/Sub trigger and then create a Cloud Scheduler job targeting the topic which triggers the function.
I did it by following these steps:
Create a Cloud Function with Pub/Sub trigger
Select your topic or create a new one
This is the default code I am using:
exports.helloPubSub = (event, context) => {
const message = event.data
? Buffer.from(event.data, 'base64').toString()
: 'Hello, World';
console.log(message);
};
Create a Scheduler job targeting with the same Pub/Sub topic
Check it is working.
I tried it with the frequency ***** (every minute) and it works for me, I can see the logs from the Cloud Function.
Currently in order to execute a Cloud Function it needs to be triggered because once it stops the execution the only way to execute it again is through the trigger.
You can also follow the same steps I indicated to you in this page where you can find some images for further help.
I have successfully scheduled my query in BigQuery, and the result is saved as a table in my dataset. I see a lot of information about scheduling data transfer in to BigQuery or Cloud Storage, but I haven't found anything regarding scheduling an export from a BigQuery table to Cloud Storage yet.
Is it possible to schedule an export of a BigQuery table to Cloud Storage so that I can further schedule having it SFTP-ed to me via Google BigQuery Data Transfer Services?
There isn't a managed service for scheduling BigQuery table exports, but one viable approach is to use Cloud Functions in conjunction with Cloud Scheduler.
The Cloud Function would contain the necessary code to export to Cloud Storage from the BigQuery table. There are multiple programming languages to choose from for that, such as Python, Node.JS, and Go.
Cloud Scheduler would send an HTTP call periodically in a cron format to the Cloud Function which would in turn, get triggered and run the export programmatically.
As an example and more specifically, you can follow these steps:
Create a Cloud Function using Python with an HTTP trigger. To interact with BigQuery from within the code you need to use the BigQuery client library. Import it with from google.cloud import bigquery. Then, you can use the following code in main.py to create an export job from BigQuery to Cloud Storage:
# Imports the BigQuery client library
from google.cloud import bigquery
def hello_world(request):
# Replace these values according to your project
project_name = "YOUR_PROJECT_ID"
bucket_name = "YOUR_BUCKET"
dataset_name = "YOUR_DATASET"
table_name = "YOUR_TABLE"
destination_uri = "gs://{}/{}".format(bucket_name, "bq_export.csv.gz")
bq_client = bigquery.Client(project=project_name)
dataset = bq_client.dataset(dataset_name, project=project_name)
table_to_export = dataset.table(table_name)
job_config = bigquery.job.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
extract_job = bq_client.extract_table(
table_to_export,
destination_uri,
# Location must match that of the source table.
location="US",
job_config=job_config,
)
return "Job with ID {} started exporting data from {}.{} to {}".format(extract_job.job_id, dataset_name, table_name, destination_uri)
Specify the client library dependency in the requirements.txt file
by adding this line:
google-cloud-bigquery
Create a Cloud Scheduler job. Set the Frequency you wish for
the job to be executed with. For instance, setting it to 0 1 * * 0
would run the job once a week at 1 AM every Sunday morning. The
crontab tool is pretty useful when it comes to experimenting
with cron scheduling.
Choose HTTP as the Target, set the URL as the Cloud
Function's URL (it can be found by selecting the Cloud Function and
navigating to the Trigger tab), and as HTTP method choose GET.
Once created, and by pressing the RUN NOW button, you can test how the export
behaves. However, before doing so, make sure the default App Engine service account has at least the Cloud IAM roles/storage.objectCreator role, or otherwise the operation might fail with a permission error. The default App Engine service account has a form of YOUR_PROJECT_ID#appspot.gserviceaccount.com.
If you wish to execute exports on different tables,
datasets and buckets for each execution, but essentially employing the same Cloud Function, you can use the HTTP POST method
instead, and configure a Body containing said parameters as data, which
would be passed on to the Cloud Function - although, that would imply doing
some small changes in its code.
Lastly, when the job is created, you can use the Cloud Function's returned job ID and the bq CLI to view the status of the export job with bq show -j <job_id>.
Not sure if this was in GA when this question was asked, but at least now there is an option to run an export to Cloud Storage via a regular SQL query. See the SQL tab in Exporting table data.
Example:
EXPORT DATA
OPTIONS (
uri = 'gs://bucket/folder/*.csv',
format = 'CSV',
overwrite = true,
header = true,
field_delimiter = ';')
AS (
SELECT field1, field2
FROM mydataset.table1
ORDER BY field1
);
This could as well be trivially setup via a Scheduled Query if you need a periodic export. And, of course, you need to make sure the user or service account running this has permissions to read the source datasets and tables and to write to the destination bucket.
Hopefully this is useful for other peeps visiting this question if not for OP :)
You have an alternative to the second part of the Maxim answer. The code for extracting the table and store it into Cloud Storage should work.
But, when you schedule a query, you can also define a PubSub topic where the BigQuery scheduler will post a message when the job is over. Thereby, the scheduler set up, as described by Maxim is optional and you can simply plug the function to the PubSub notification.
Before performing the extraction, don't forget to check the error status of the pubsub notification. You have also a lot of information about the scheduled query; useful is you want to perform more checks or if you want to generalize the function.
So, another point about the SFTP transfert. I open sourced a projet for querying BigQuery, build a CSV file and transfert this file to FTP server (sFTP and FTPs aren't supported, because my previous company only used FTP protocol!). If your file is smaller than 1.5Gb, I can update my project for adding the SFTP support is you want to use this. Let me know
I need to get the data from a third party API and ingest it in google BigQuery. Perhaps, I need to automate this process through google services to do it periodically.
I am trying to use Cloud Functions, but it needs a trigger. I have also read about App Engine, but I believe it is not suitable for only one function to make pull requests.
Another doubt is: do I need to load the data into cloud storage or can I load it straight to BigQuery? Should I use Dataflow and make any configuration?
def upload_blob(bucket_name, request_url, destination_blob_name):
"""
Uploads a file to the bucket.
"""
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
request_json = requests.get(request_url['url'])
print('File {} uploaded to {}.'.format(
bucket_name,
destination_blob_name))
def func_data(request_url):
BUCKET_NAME = 'dataprep-staging'
BLOB_NAME = 'any_name'
BLOB_STR = '{"blob": "some json"}'
upload_blob(BUCKET_NAME, request_url, BLOB_NAME)
return f'Success!'
I expect advise about the architecture (google services) that I should use for creating this pipeline. For example, use cloud functions (get the data from API), then schedule a job using service 'X' to input data to storage and finally pull the data from storage.
You can use function. Create an http triggered function and call it periodically with cloud scheduler.
By the way, you can also call http endpoint of appengine or cloud run.
About storage, answer is no. If the API result is not too large for function allowed memory, you can write in /tmp directory and load data to bigquery with this file. You can size your function up to 2go if needed
I need to automate a process to extract data from Google Big Query and exported to an external CSV in a external server outside of the GCP.
I just researching how to to that I found some commands to run from my External Server. But I prefer to do everything in GCP to avoid possible problems.
To run the query to CSV in Google storage
bq --location=US extract --compression GZIP 'dataset.table' gs://example-bucket/myfile.csv
To Download the csv from Google Storage
gsutil cp gs://[BUCKET_NAME]/[OBJECT_NAME] [OBJECT_DESTINATION]
But I would like to hear your suggestions
If you want to fully automatize this process, I would do the following:
Create a Cloud Function to handle the export:
This is the more lightweight solution, as Cloud Functions are serverless, and provide flexibility to implement code with the Client Libraries. See the quickstart, I recommend you to use the console to create the functions to start with.
In this example I recommend you to trigger the Cloud Function from an HTTP request, i.e. when the function URL is called, it will run the code inside of it.
An example Cloud Function code in Python, that creates the export when a HTTP request is made:
main.py
from google.cloud import bigquery
def hello_world(request):
project_name = "MY_PROJECT"
bucket_name = "MY_BUCKET"
dataset_name = "MY_DATASET"
table_name = "MY_TABLE"
destination_uri = "gs://{}/{}".format(bucket_name, "bq_export.csv.gz")
bq_client = bigquery.Client(project=project_name)
dataset = bq_client.dataset(dataset_name, project=project_name)
table_to_export = dataset.table(table_name)
job_config = bigquery.job.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
extract_job = bq_client.extract_table(
table_to_export,
destination_uri,
# Location must match that of the source table.
location="US",
job_config=job_config,
)
return "Job with ID {} started exporting data from {}.{} to {}".format(extract_job.job_id, dataset_name, table_name, destination_uri)
requirements.txt
google-cloud-bigquery
Note that the job will run asynchronously in the background, you will receive a return response with the job ID, which you can use to check the state of the export job in the Cloud Shell, by running:
bq show -j <job_id>
Create a Cloud Scheduler scheduled job:
Follow this documentation to get started. You can set the Frequency with the standard cron format, for example 0 0 * * * will run the job every day at midnight.
As a target, choose HTTP, in the URL put the Cloud Function HTTP URL (you can find it in the console, inside the Cloud Function details, under the Trigger tab), and as HTTP method choose GET.
Create it, and you can test it in the Cloud Scheduler by pressing the Run now button in the Console.
Synchronize your external server and the bucket:
Up until now you only have scheduled exports to run every 24 hours, now to synchronize the bucket contents with your local computer, you can use the gsutil rsync command. If you want to save the imports, lets say to the my_exports folder, you can run, in your external server:
gsutil rsync gs://BUCKET_WITH_EXPORTS /local-path-to/my_exports
To periodically run this command in your server, you could create a standard cron job in your crontab inside your external server, to run each day as well, just at a few hours later than the bigquery export, to ensure that the export has been made.
Extra:
I have hard-coded most of the variables in the Cloud Function to be always the same. However, you can send parameters to the function, if you do a POST request instead of a GET request, and send the parameters as data in the body.
You will have to change the Cloud Scheduler job to send a POST request to the Cloud Function HTTP URL, and in the same place you can set the body to send the parameters regarding the table, dataset and bucket, for example. This will allow you to run exports from different tables at different hours, and to different buckets.
Referring to item: Watching for new files matching a filepattern in Apache Beam
Can you use this for simple use cases? My use case is that I have user uploads data to Cloud Storage -> Pipeline (Process csv to json) -> Big Query. I know Cloud Storage is bounded collection so it represents Batch Dataflow.
What I would like is to do is keep pipeline running in streaming mode and as soon as a file is uploaded to Cloud Storage, it will be processed through pipeline. Is this possible with watchfornewfiles?
I wrote my code as follows:
p.apply(TextIO.read().from("<bucketname>")
.watchForNewFiles(
// Check for new files every 30 seconds
Duration.standardSeconds(30),
// Never stop checking for new files
Watch.Growth.<String>never()));
None of the contents is being forwarded to Big Query, but the pipeline shows that it is streaming.
You may use Google Cloud Storage Triggers here :
https://cloud.google.com/functions/docs/calling/storage#functions-calling-storage-python
These triggers uses Cloud Functions similar to Cloud Pub/Sub which gets triggered on objects if they were: created/ deleted/archived/ or metadata change.
These event are sent using Pub/Sub notifications from Cloud Storage, but pay attention not to set many functions over the same bucket as there is some notification limits.
Also, at the end of the document there is a link to a sample implementation.