Is it possible to get per file statistics (or at least download count) for files in google cloud storage?
I want to find the number of downloads for a js plugin file to get an idea of how frequently these are used (in client pages).
Yes, it is possible, but it has to be enabled.
The official recommendation is to create another bucket for the logs generated by the main bucket that you want to trace.
gsutil mb gs://<some-unique-prefix>-example-logs-bucket
then assign Cloud Storage the roles/storage.legacyBucketWriter role for the bucket:
gsutil iam ch group:cloud-storage-analytics#google.com:legacyBucketWriter gs://<some-unique-prefix>-example-logs-bucket
and finally enable the logging for your main bucket:
gsutil logging set on -b gs://example-logs-bucket gs://<main-bucket>
Generate some activity on your main bucket, then wait for one hour at most, hence the reports are not generated hourly and daily. You will be able to browse these events on the logs-bucket created at step 1:
https://imgur.com/a/fncnxwM (imgur is down at the moment..I will fix this image later)
More info, can be found at https://cloud.google.com/storage/docs/access-logs
In most cases, using Cloud Audit Logs is now recommended instead of using legacyBucketWriter.
Logging to a separate Cloud Storage bucket with legacyBucketWriter produces csv files, which you would have to then load into BigQuery yourself to make them actionable, and this would be done far from in real time. Cloud Audit Logs are easier to set up and work with by comparison, and logs are delivered almost instantly.
Related
Can someone point me to step-by-step instructions for how to track how many times the files in a Google Cloud Storage bucket were accessed or downloaded? Yes, I know I can create a sink in GCP Logging to export logs to BigQuery. But it is not clear to me what the inclusion filter should be to only export GCS access logs, nor is it clear to me how I would query the log entries.
It shouldn't be hard to track how many times a GCS file is read or downloaded, but I have not been able to find a step by step tutorial that shows how to do it.
With audit logs, you can filter on the objects.get API call
I want to transfer files into a VM whenever there is a new file added to storage, the problem is that i want the transfer to be done only when the upload is complete
So my question is : Do files appear even when the upload is still going on ? which means if I build a program that looks for new files every second, would it transfer the files from gcs to VM even if the upload is incomplete or the transfer would start whenever the upload is complete and not while it is uploading ?
Google Cloud Storage uploads are strongly consistent for object uploads. This means that the object is not visible until the object is 100% uploaded and any Cloud Storage housekeeping (such as replication) is complete. You cannot see nor access an object until the upload has completed and your software/tool receives a success response.
Google Cloud Storage Consistency
Do files appear even when the upload is still going on ? which means
if I build a program that looks for new files every second, would it
transfer the files from gcs to VM even if the upload is incomplete or
the transfer would start whenever the upload is complete and not while
it is uploading ?
No, your program will not see new objects until the new objects are 100% available. In Google Cloud Storage there are not partial uploads.
Files do not appear in the UI of Cloud Storage until the file is completely upload it to the specified bucket by the user.
I attached you how Google Cloud Platform manage the consistency in Google Cloud Storage Buckets here.
You could use gsutil to list all the files in one of your Cloud Storage Buckets at any moment, as stated here.
As for the application you are trying to develop, I highly suggest you to use Google Cloud Functions, in conjunction with triggers.
In this case, you could use google.storage.object.finalize trigger in order to execute your function every time a new object is uploaded to one of your buckets. You can see examples of this application here.
The Cloud Function will ensure that your bucket is correctly uploaded to the bucket before attempting to transfer the object to your GCE instance.
Therefore, after completing the upload, the only thing left will be to execute gcloud compute scp to copy files to your Google Compute Engine via scp, as stated here.
I have a requirement, we have one web Application.
from that application, we are downloading the Logs by clicking the Download Button ( manually).
After download using AWS CLI uploading the Logs into S3 then processing the data.
can we do automate this?
please help me to automate this If we can.
Thanks in Advance.
You can create a lambda function and assume a role of ec2-lambda and collect the logs and move them into a S3 bucket even with the time stamp as well you can even schedule it using cloud watch if you want the log backup at a specific time .
You can also use Ansible or Jenkins to do the task in Jenkins you can create a job and it has S3 plugin even available and simply run your Jenkins job which going to copy the logs to your s3 buckets
I have a csv hosted on a server which updates daily. I'd like to setup a transfer to load this into Google Cloud Storage so that I can then query it using BigQuery.
I'm looking at the transfer service and it doesn't seem to have what I need e.g. only accepts csvs or files from other google storage buckets or amazon s3 buckets.
Thanks in advance
You can also use a URL to a TSV file as explained here and configure the transfer to run daily at the time of your choice.
Alternatively, if it still doesn't fit your need, you may install gsutil on your remote machine and use the gsutil rsync command and schedule it to run daily.
My usecase is to process S3 access logs(having those 18 fields) periodically and push to table in RDS. I'm using AWS data pipeline for this task to run everyday to process previous day's logs.
I decided to split the task into two activities
1. Shell Command Activity : To process s3 access logs and create a csv file
2. Hive Activity : To read data from csv file and insert to RDS table.
My input s3 bucket has lots of log files hence first activity fails due to out of memory error while staging. However i don't want to stage all the logs, staging the previous day's log is enough for me. I searched around internet but didn't get any solution. How do i achieve this ? Is my solution the optimal one ? Does any solution better than this exist ? Any suggestions will be helpful
Thanks in Advance
You can define your S3 data node use timestamps. For e.g. you can say the directory path is
s3://yourbucket/ #{format(#scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}
Since your log files should have a timestamp in the name (or they could be organized by timestamped directories).
This will only stage the files matching that pattern.
You may be recreating a solution that is already done by Logstash (or more precisely the ELK stack).
http://logstash.net/docs/1.4.2/inputs/s3
Logstash can consume S3 files.
Here is a thread on reading access logs from S3
https://groups.google.com/forum/#!topic/logstash-users/HqHWklNfB9A
We use Splunk (not-free) that has the same capabilities through its AWS plugin.
May I ask why are you pushing the access logs to RDS?
ELK might be a great solution for you. You can build it on your own or use ELK-as-a-service from Logz.io (I work for Logz.io).
It enables you to easily define an S3 bucket, get all your logs read regularly from the bucket and ingested by ELK and view them in preconfigured dashboards.