How to know if a zero bytes video was uploaded to a gcp bucket? - google-cloud-platform

As the title says, I need to fetch the size of the video / object I just uploaded to the bucket.
Every few seconds, an object is uploaded to my bucket which is of the form, {id}/video1.mp4.
I want to make use of google cloud storage triggers which would alert me if a 0 byte video was added. Can someone pls suggest me how to access the size of the added object.

Farhan,
Assuming you know the basics of cloud functions. You can create a cloud function trigger that runs a script every-time you create/finalize an object in a selected bucket.
The link you posted contains the tutorial and the following python script attached.
def hello_gcs_generic(data, context):
"""Background Cloud Function to be triggered by Cloud Storage.
This generic function logs relevant data when a file is changed.
Args:
data (dict): The Cloud Functions event payload.
context (google.cloud.functions.Context): Metadata of triggering event.
Returns:
None; the output is written to Stackdriver Logging
"""
print('Event ID: {}'.format(context.event_id))
print('Event type: {}'.format(context.event_type))
print('Bucket: {}'.format(data['bucket']))
print('File: {}'.format(data['name']))
print('Metageneration: {}'.format(data['metageneration']))
print('Created: {}'.format(data['timeCreated']))
print('Updated: {}'.format(data['updated']))
In this example, we see data has multiple items such as name, timeCreated ect.
What this example doesn't show however is that data has another item, SIZE!
listed as data['size']
So now we have a cloud function that gets the filename, and file size of whatever is uploaded when it's uploaded!. all we have to do now is create an if statement to do "something" if the file size is = 0. It will look something like this in python. (apologies for syntax issues, but this is the jist of it)
def hello_gcs_generic(data, context):
"""Background Cloud Function to be triggered by Cloud Storage.
This generic function logs relevant data when a file is changed.
Args:
data (dict): The Cloud Functions event payload.
context (google.cloud.functions.Context): Metadata of triggering event.
Returns:
None; the output is written to Stackdriver Logging
"""
print('File: {}'.format(data['name']))
print('Size: {}'.format(data['size']))
size = data['size']
if size == 0:
print("its 0!")
else:
print("its not 0!")
Hope this helps!

Related

Sending message to HTTP Google Cloud Function

I would like to send a message to an HTTP triggered Google Cloud Function. Specifically I want to tell the function when a file version has changed so that the function loads the new version of the file in memory.
I thought about updating an environment variable as a way of sending that message but it is not so straightforward to run an update-env-vars since this needs to be done in the context of the function's project.
Also I thought of using a database which sounds like too much for a single variable and using a simple text file in storage with the current version which sounds too little. Any other idea?
According to the conversation in the comments section, I believe the best way to achieve what you are looking for is a gcs notification triggering PubSub.
gsutil notification create -t TOPIC_NAME -f json gs://BUCKET_NAME
PubSub will get notified based on event types and this I believe it will depend on what you consider a new version of the file (metadata changes? new blob will be created?)
Basically, you can pass the -e flag in the command above which indicates the event type:
OBJECT_FINALIZE Sent when a new object (or a new generation of an
existing object) is successfully created in the bucket. This includes
copying or rewriting an existing object. A failed upload does not
trigger this event.
OBJECT_METADATA_UPDATE Sent when the metadata of an existing object
changes.
That means, any file upload or metadata change in GCS it will trigger PubSub which triggers your Cloud Function. Function example to pull message from PubSub
def hello_pubsub(event, context):
import base64
print("""This Function was triggered by messageId {} published at {} to {}
""".format(context.event_id, context.timestamp, context.resource["name"]))
if 'data' in event:
name = base64.b64decode(event['data']).decode('utf-8')
else:
name = 'World'
print('Hello {}!'.format(name))
Documents for reference:
https://cloud.google.com/storage/docs/pubsub-notifications
https://cloud.google.com/functions/docs/calling/pubsub#functions_calling_pubsub-python

Is there a way to extract the cost of a Google Cloud Function execution?

When we for instance deploy and call a Google Cloud Function with the PubSub trigger, we can receive the data and the context in Python as follows:
def hello_pubsub(event, context):
"""Background Cloud Function to be triggered by Pub/Sub.
Args:
event (dict): The dictionary with data specific to this type of
event. The `data` field contains the PubsubMessage message. The
`attributes` field will contain custom attributes if there are any.
context (google.cloud.functions.Context): The Cloud Functions event
metadata. The `event_id` field contains the Pub/Sub message ID. The
`timestamp` field contains the publish time.
"""
import base64
print("""This Function was triggered by messageId {} published at {}
""".format(context.event_id, context.timestamp))
if 'data' in event:
name = base64.b64decode(event['data']).decode('utf-8')
else:
name = 'World'
print('Hello {}!'.format(name))
Is there a possibility to use the context.event_id or context to determine the total cost at the end of the execution?
The billing for Cloud Functions is tied to the time spent for execution and the machine type you are using. This can be seen in their documentation.
You would be better of checking with the Stackdriver logs for the time which the function took to execute, and use that as a basis for doing the billing approximation. I say approximation due to the fact that even with the logs timestamp, there may be a bit of discrepancy between your results and Google's billing at the end of the month.
Additionally, you would need to have an estimation of how many times you expect the function to be called in order to have a better appreciation of the total expenses to be expected for the month.
Hope you find this useful.
Nowadays you can view this in the metrics explorer within the monitoring service on the google cloud console, which is a pretty accurate estimate. You can use the metrics explorer to construct a query that gives back the data you are looking for. An example of such a query is shown below:
Programatically you can get the same information using the monitoring API. Use this page to construct and test the correct API call and call the API. You can input the query in the request body:

Google Cloud Pub/Sub Trigger on Google Images

We need a way to automatically create a Pub/Sub trigger on new compute images (preferably triggered on a specific image family). Alternatively, we know that Pub/Sub on GCS buckets, but we have not found a way to automate transferring images to a GCS bucket.
For some background: we are automating image baking through packer and we need this piece to trigger a terraform creation. We know that a cron job can be created to simply poll images when they are created, but we are wondering if there is already support for such a trigger in GCP.
You can have a Stackdriver Logging export sink that publishes to Pub/Sub and is triggered by a specific filter (docs). For example:
resource.type="gce_image"
jsonPayload.event_subtype="compute.images.insert"
jsonPayload.event_type="GCE_OPERATION_DONE"
To trigger it only for a specific family you can use this other filter below but protoPayload.request.family is only present when the API request is received and not when it is actually fulfilled (maybe you could add some delay in your processing function if needed)
resource.type="gce_image"
protoPayload.request."#type"="type.googleapis.com/compute.images.insert"
protoPayload.request.family="FAMILY-NAME"
Another solution would be to create a cloud function with --trigger-topic={your pub sub topic} and then filter only the images that you want to act on based on some environment variables on the cloud function
Psuedo code
1. create a pub sub topic for images being inserted in the GCR
gcloud pubsub topics create projects/<project_id>/topics/gcr
This will now publish all messages corresponding to all images being inserted/modified/deleted in the repo
Create a cloud function that has the function signature thus
// contents of index.js
// use the Storage function from google-coud node js api to work on storages
// https://www.npmjs.com/package/#google-cloud/storage
const Storage = require(#google-cloud/storage).Storage;
function moveToStorageBucket(pubSubEvents, context, callback) {
/* this is how the pubsub comes from GCR
{"data":{"#type":"... .v1.PuSubMessage", "attribute":null, "data": "<base 64 encoded>"},
"context":{..other details}}
data that is base 64 encoded in in this format
{ "action":"INSERT","digest":"<image name>","tag":<"tag name>"}
*/
const data = JSON.parse(Buffer.from(pubSubEvents.data, 'base64').toString())
// get image name from the environment variable passed
const IMAGE_NAME = process.env.IMAGE_NAME;
if (data.digest.indexOf(IMAGE_NAME) !== -1) {
// your action here...
}
}
module.exports.moveToStorageBucket = moveToStorageBucket;
deploy the cloud function
gcloud functions deploy <function_name> --region <region> --runtime=nodejs8 --trigger-topic=<topic created> --entry-point=moveToStorageBucket --set-env-vars=^--^IMAGE_NAME=<your image name>
Hope that helps

How to make requests in third party APIs and load the results periodically on google BigQuery? What google services should I use?

I need to get the data from a third party API and ingest it in google BigQuery. Perhaps, I need to automate this process through google services to do it periodically.
I am trying to use Cloud Functions, but it needs a trigger. I have also read about App Engine, but I believe it is not suitable for only one function to make pull requests.
Another doubt is: do I need to load the data into cloud storage or can I load it straight to BigQuery? Should I use Dataflow and make any configuration?
def upload_blob(bucket_name, request_url, destination_blob_name):
"""
Uploads a file to the bucket.
"""
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
request_json = requests.get(request_url['url'])
print('File {} uploaded to {}.'.format(
bucket_name,
destination_blob_name))
def func_data(request_url):
BUCKET_NAME = 'dataprep-staging'
BLOB_NAME = 'any_name'
BLOB_STR = '{"blob": "some json"}'
upload_blob(BUCKET_NAME, request_url, BLOB_NAME)
return f'Success!'
I expect advise about the architecture (google services) that I should use for creating this pipeline. For example, use cloud functions (get the data from API), then schedule a job using service 'X' to input data to storage and finally pull the data from storage.
You can use function. Create an http triggered function and call it periodically with cloud scheduler.
By the way, you can also call http endpoint of appengine or cloud run.
About storage, answer is no. If the API result is not too large for function allowed memory, you can write in /tmp directory and load data to bigquery with this file. You can size your function up to 2go if needed

How to use watchfornewfiles in Dataflow with GCS source bucket?

Referring to item: Watching for new files matching a filepattern in Apache Beam
Can you use this for simple use cases? My use case is that I have user uploads data to Cloud Storage -> Pipeline (Process csv to json) -> Big Query. I know Cloud Storage is bounded collection so it represents Batch Dataflow.
What I would like is to do is keep pipeline running in streaming mode and as soon as a file is uploaded to Cloud Storage, it will be processed through pipeline. Is this possible with watchfornewfiles?
I wrote my code as follows:
p.apply(TextIO.read().from("<bucketname>")
.watchForNewFiles(
// Check for new files every 30 seconds
Duration.standardSeconds(30),
// Never stop checking for new files
Watch.Growth.<String>never()));
None of the contents is being forwarded to Big Query, but the pipeline shows that it is streaming.
You may use Google Cloud Storage Triggers here :
https://cloud.google.com/functions/docs/calling/storage#functions-calling-storage-python
These triggers uses Cloud Functions similar to Cloud Pub/Sub which gets triggered on objects if they were: created/ deleted/archived/ or metadata change.
These event are sent using Pub/Sub notifications from Cloud Storage, but pay attention not to set many functions over the same bucket as there is some notification limits.
Also, at the end of the document there is a link to a sample implementation.