Transfer files from GCS bucket to Webdav server - google-cloud-platform

I am new to GCP and was wondering whther what I am trying to achieve is possible.
I have a dataflow job which creates a csv file on a daile basis and stores it to GCS bcket. This file is overwritten everyday.
What I want to do is when a file is created or overwritten then automaically transfer the file to a WebDav server. I need to scheduke this process on a daily basis.
Is this possibnle to set up within GCS?
Any advice is apprecaited.
i have been looking at cloud file transfers and data transfer but its not correct

You can use Cloud Functions to trigger a transfer of the file whenever it is created or overwritten in the GCS bucket.According to the Cloud Storage Triggers Documentation
In Cloud Functions, a Cloud Storage trigger enables a function to be
called in response to changes in Cloud Storage. When you specify a
Cloud Storage trigger for a function, you choose an event type and
specify a Cloud Storage bucket. Your function will be called whenever
a change occurs on an object (file) within the specified bucket.
object.finalize - Triggered when a new object is created, or an existing object is overwritten and a new generation of that object is
created.
Check this Cloud Storage function tutorial for an example of writing, deploying, and calling a function with a Cloud Storage trigger.

Related

Get error/notification when cloud function fails to upload file into BQ

I created a cloud function that:
Whenever a new file is uploaded into the google storage bucket, Function gets triggered and uploads that into bigquery
The function is running smoothly but I have a question. Another person uploads CSV into the bucket and I assured her the data will be pushed into BigQuery straightaway. How I can get informed whenever an upload fails? (She may upload hundreds of files per day)
If I open the bucket is there any way to understand which file is uploaded and which failed?

Advice on filtering GCP Cloud Storage event notifications, based on an object's prefix, when triggering a GCP Cloud Function?

Currently, I am moving services from AWS to GCP. Previously, I relied on an AWS S3 bucket and the inbuilt service's logic to configure event notifications to get triggered when an object with a particular prefix was inserted into my bucket. This specific event notification, which contained the prefix, would then be fed forward to trigger a lambda function.
However, now I have to leverage GCP Cloud Storage buckets and trigger a Cloud Function. My observations so far have been that I can't specify a prefix/suffix as part of my Cloud Storage service's bucket directly. Instead, I have to specify a Cloud Storage bucket to monitor during the creation of my Cloud Function. My concern with this approach is that I can't limit the bucket's object events to the three of interest to me: '_MANIFEST' '_PROCESSING' and '_PROCESSED' but rather have to pick an global event notification type of interest to me such as 'OBJECT_FINALIZE'.
There are two viable approaches I can see to this problem:
Have all the 'OBJECT_FINALIZE' event notifications trigger the Cloud Function and filter out any additional objects (those which don't contain the prefix). The issue with this approach is the unneccessary activation of the Cloud Function and the additional log files getting generated - which are of no inherent value.
Use the audit logs generated by the Cloud Storage bucket and create rules to generate events based on the watched trigger file i.e. '_MANIFEST', 'PROCESSING' and 'PROCESSED'. My concern with this approach is that I don't know how easily it will be to forward all the information about the bucket I'm interested in if I'm generating the event based on a logging rule - I am primarily interested in the information which gets forwarded by an event notification. Also, currently I have verified that the object being added to my Cloud Storage bucket is not public and I have enabled the following:
However, I tried to filter the audit logs in the GCP 'Monitoring' service (after adding a _MANIFEST object to the bucket of course) but the logs are not appearing within the 'Log Explorer'.
Any advice on how I should approach filtering the event notification of interest in GCP, when triggering my Cloud Function, would be greatly appreciated.
To achieve this, you can sink the Cloud Storage notification into PubSUb.
Then, you can create a PubSUb push subscription to your Cloud Functions (it's no longer a background functions triggered by Cloud Storage event, but and HTTP function trigger by HTTP request.
The main advantage of doing that is that you can specify a filter on PubSub push subscription that allow you to activate your Cloud Functions (or any other HTTP endpoint) only with the pattern is enforced.

Automatically start load to BigQuery when file is uploaded to bucket/cloud storage

We have a script exporting csv-files from another database and uploading them to a bucket on GCP cloud storage. Now I know there's the possibility to schedule loads into BigQuery using BigQuery Data Transfer Service but I am a bit surprised that there doesn't seem to be a solution which triggers automatically when a file-upload is finished.
Did I miss something?
You might need to handle that event (google.storage.object.finalize) by your own means.
For example, that event can trigger a cloud function (Google Cloud Storage Triggers), which can do various things - from triggering a load job, to implmenting a complex data processing (cleaning, validation, merging, etc.) while the data from the file is being loaded to the BigQuery table.

Moving objects from one GCS bucket to another Bucket using Terraform

I'd like to use Terraform to move multiple GCS bucket objects from one bucket to another bucket to a different location.
I read through Terraform documentation but I couldn't find anything substantial.
Terraform for Cloud Storage provider only handles creation of object. What you can do as a workaround is to use Terraform with Storage Transfer Service which schedules a job that transfers multiple objects to a GCS bucket which either came from AWS S3 or another GCS.
Since this is a GCS to GCS transfer, you can take note of:
Under transfer spec block, only specify the gcs_data_source to indicate that it is a GCS to GCS transfer.
The schedule block specifies the time when the transfer will start. If you intend to execute it just once, you can specify the schedule_end_date immediately.
The Storage Transfer Service feature also offers guide through the Google Cloud Console should you want to try it out:
https://cloud.google.com/storage-transfer/docs/create-manage-transfer-console#configure

GCP-Storage: do files appear before upload is complete

I want to transfer files into a VM whenever there is a new file added to storage, the problem is that i want the transfer to be done only when the upload is complete
So my question is : Do files appear even when the upload is still going on ? which means if I build a program that looks for new files every second, would it transfer the files from gcs to VM even if the upload is incomplete or the transfer would start whenever the upload is complete and not while it is uploading ?
Google Cloud Storage uploads are strongly consistent for object uploads. This means that the object is not visible until the object is 100% uploaded and any Cloud Storage housekeeping (such as replication) is complete. You cannot see nor access an object until the upload has completed and your software/tool receives a success response.
Google Cloud Storage Consistency
Do files appear even when the upload is still going on ? which means
if I build a program that looks for new files every second, would it
transfer the files from gcs to VM even if the upload is incomplete or
the transfer would start whenever the upload is complete and not while
it is uploading ?
No, your program will not see new objects until the new objects are 100% available. In Google Cloud Storage there are not partial uploads.
Files do not appear in the UI of Cloud Storage until the file is completely upload it to the specified bucket by the user.
I attached you how Google Cloud Platform manage the consistency in Google Cloud Storage Buckets here.
You could use gsutil to list all the files in one of your Cloud Storage Buckets at any moment, as stated here.
As for the application you are trying to develop, I highly suggest you to use Google Cloud Functions, in conjunction with triggers.
In this case, you could use google.storage.object.finalize trigger in order to execute your function every time a new object is uploaded to one of your buckets. You can see examples of this application here.
The Cloud Function will ensure that your bucket is correctly uploaded to the bucket before attempting to transfer the object to your GCE instance.
Therefore, after completing the upload, the only thing left will be to execute gcloud compute scp to copy files to your Google Compute Engine via scp, as stated here.