I want to transfer files into a VM whenever there is a new file added to storage, the problem is that i want the transfer to be done only when the upload is complete
So my question is : Do files appear even when the upload is still going on ? which means if I build a program that looks for new files every second, would it transfer the files from gcs to VM even if the upload is incomplete or the transfer would start whenever the upload is complete and not while it is uploading ?
Google Cloud Storage uploads are strongly consistent for object uploads. This means that the object is not visible until the object is 100% uploaded and any Cloud Storage housekeeping (such as replication) is complete. You cannot see nor access an object until the upload has completed and your software/tool receives a success response.
Google Cloud Storage Consistency
Do files appear even when the upload is still going on ? which means
if I build a program that looks for new files every second, would it
transfer the files from gcs to VM even if the upload is incomplete or
the transfer would start whenever the upload is complete and not while
it is uploading ?
No, your program will not see new objects until the new objects are 100% available. In Google Cloud Storage there are not partial uploads.
Files do not appear in the UI of Cloud Storage until the file is completely upload it to the specified bucket by the user.
I attached you how Google Cloud Platform manage the consistency in Google Cloud Storage Buckets here.
You could use gsutil to list all the files in one of your Cloud Storage Buckets at any moment, as stated here.
As for the application you are trying to develop, I highly suggest you to use Google Cloud Functions, in conjunction with triggers.
In this case, you could use google.storage.object.finalize trigger in order to execute your function every time a new object is uploaded to one of your buckets. You can see examples of this application here.
The Cloud Function will ensure that your bucket is correctly uploaded to the bucket before attempting to transfer the object to your GCE instance.
Therefore, after completing the upload, the only thing left will be to execute gcloud compute scp to copy files to your Google Compute Engine via scp, as stated here.
Related
I am new to GCP and was wondering whther what I am trying to achieve is possible.
I have a dataflow job which creates a csv file on a daile basis and stores it to GCS bcket. This file is overwritten everyday.
What I want to do is when a file is created or overwritten then automaically transfer the file to a WebDav server. I need to scheduke this process on a daily basis.
Is this possibnle to set up within GCS?
Any advice is apprecaited.
i have been looking at cloud file transfers and data transfer but its not correct
You can use Cloud Functions to trigger a transfer of the file whenever it is created or overwritten in the GCS bucket.According to the Cloud Storage Triggers Documentation
In Cloud Functions, a Cloud Storage trigger enables a function to be
called in response to changes in Cloud Storage. When you specify a
Cloud Storage trigger for a function, you choose an event type and
specify a Cloud Storage bucket. Your function will be called whenever
a change occurs on an object (file) within the specified bucket.
object.finalize - Triggered when a new object is created, or an existing object is overwritten and a new generation of that object is
created.
Check this Cloud Storage function tutorial for an example of writing, deploying, and calling a function with a Cloud Storage trigger.
Can someone point me to step-by-step instructions for how to track how many times the files in a Google Cloud Storage bucket were accessed or downloaded? Yes, I know I can create a sink in GCP Logging to export logs to BigQuery. But it is not clear to me what the inclusion filter should be to only export GCS access logs, nor is it clear to me how I would query the log entries.
It shouldn't be hard to track how many times a GCS file is read or downloaded, but I have not been able to find a step by step tutorial that shows how to do it.
With audit logs, you can filter on the objects.get API call
Was using the GCP data transfer service to transfer a set of files in an S3 bucket over to a GCP storage bucket. The files were in glacier so I first restored them and then tried copying them. The transfer job runs without any errors, but ignores any of the glacier restored files. Is this the expected behavior? And if it is then it seems like a huge oversight not to mention this in the documentation. I could easily imagine a scenario where you think you've mirrored a bucket when you really haven't.
I was inspecting the infrastructure points I have on my Google Cloud to remove any lose points...
Then i noticed that google cloud storage have 5 buckets [even that i just created 2 of them]
these 5 buckets are:
1 - bucket i created
2 - bucket i created
3 - PROJECT.backups
4 - gcf-sources-CODE-us-central1
5 - us.artifacts.PROJECT.appspot.com
I understand that the backups bucket come from firebase realtime database backups and the sources bucket come from the firebase cloud functions code. BUT where does the artifacts bucket comes from? this bucket alone has TWICE the size of all other buckets together.
Its contents are just binary files named like "sha256:HASH" some of which are larger than 200MB
I deleted this bucket and it was re-created [without my interaction] again next day.
Does anyone know what might be using it? how can i track it down? what is it for?
The us.artifacts.<project id>.appspot.com bucket is created and used by Cloud Build to store container images generated by the Cloud Build service. One of the processes that generates objects in this bucket is Cloud Function, and you can realize this because the first time that you create a function, GCP asks you to enable the Cloud Build API and this bucket appears in the Cloud Storage section. App Engine also stores objects in this bucket each time you deploy a new version of an app.
As it is mentioned in the documentation, in the case of App Engine, once the deployment has been completed, the images in the us.artifacts.<project id>.appspot.com bucket are no longer needed, so it is safe to delete them. However, in the case that you are only using Cloud Functions, it is not recommended to delete the objects in this bucket. Although you are not experiencing issues now, there is a possibility that you can experience them in the future, so instead of delete all of the objects manually, you can use the Lifecycle Object Management to delete the objects in this bucket every certain period of time, for instance, every 7 days. You can do it by navigating to the Lifecycle tab of the us.artifacts.<project id>.appspot.com bucket and adding a new lifecycle rule which deletes objects that have the age greater than X days.
This is your docker registry. Each time you push (either via docker push or by using the Cloud Build service) GCP stores image layers in those buckets.
Is it possible to get per file statistics (or at least download count) for files in google cloud storage?
I want to find the number of downloads for a js plugin file to get an idea of how frequently these are used (in client pages).
Yes, it is possible, but it has to be enabled.
The official recommendation is to create another bucket for the logs generated by the main bucket that you want to trace.
gsutil mb gs://<some-unique-prefix>-example-logs-bucket
then assign Cloud Storage the roles/storage.legacyBucketWriter role for the bucket:
gsutil iam ch group:cloud-storage-analytics#google.com:legacyBucketWriter gs://<some-unique-prefix>-example-logs-bucket
and finally enable the logging for your main bucket:
gsutil logging set on -b gs://example-logs-bucket gs://<main-bucket>
Generate some activity on your main bucket, then wait for one hour at most, hence the reports are not generated hourly and daily. You will be able to browse these events on the logs-bucket created at step 1:
https://imgur.com/a/fncnxwM (imgur is down at the moment..I will fix this image later)
More info, can be found at https://cloud.google.com/storage/docs/access-logs
In most cases, using Cloud Audit Logs is now recommended instead of using legacyBucketWriter.
Logging to a separate Cloud Storage bucket with legacyBucketWriter produces csv files, which you would have to then load into BigQuery yourself to make them actionable, and this would be done far from in real time. Cloud Audit Logs are easier to set up and work with by comparison, and logs are delivered almost instantly.