Accessing and logging the amount of S3 POST/GETs by filename - amazon-web-services

I have several JSON files in an S3 bucket. I need to do a monthly count of the amount of put/gets each file receives in a month.
Can these be extracted via CSV or even accessed via an API? I have looked at Cloudwatch and there doesn't appear to be an option for this, or within the billing dashboard.
If this feature doesn't exist, are there any workarounds such as a Lamba function with a counter?

Enable bucket logs under -
s3 > bucket > properties > server access logging > configure target
bucket/prefix
Use Athena to query this data using simple SQL statements. Read more about Athena HERE

You can set up access logging for S3 buckets.
https://docs.aws.amazon.com/AmazonS3/latest/dev/ServerLogs.html
Then you get the ability to export these logs. After that you can do anything with the logs. E.g. a bash script that can count how many requests each file gets.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/S3ExportTasksConsole.html

Related

How to make my log sync bucket persistent in GCP

I am storing GCP's cloud logging in a log bucket, but there is a limit to the storage period.
I would like to store the log bucket permanently in another bucket as a backup, is there a good way to do this?
You can keep the logs up to 10 years in a custom log bucket.
If you need more, you can export the logs to Cloud Storage and archive them.
If you need query capability beyond 10 years, you can export the logs to BigQuery

google cloud storage per file statistics/downloads

Is it possible to get per file statistics (or at least download count) for files in google cloud storage?
I want to find the number of downloads for a js plugin file to get an idea of how frequently these are used (in client pages).
Yes, it is possible, but it has to be enabled.
The official recommendation is to create another bucket for the logs generated by the main bucket that you want to trace.
gsutil mb gs://<some-unique-prefix>-example-logs-bucket
then assign Cloud Storage the roles/storage.legacyBucketWriter role for the bucket:
gsutil iam ch group:cloud-storage-analytics#google.com:legacyBucketWriter gs://<some-unique-prefix>-example-logs-bucket
and finally enable the logging for your main bucket:
gsutil logging set on -b gs://example-logs-bucket gs://<main-bucket>
Generate some activity on your main bucket, then wait for one hour at most, hence the reports are not generated hourly and daily. You will be able to browse these events on the logs-bucket created at step 1:
https://imgur.com/a/fncnxwM (imgur is down at the moment..I will fix this image later)
More info, can be found at https://cloud.google.com/storage/docs/access-logs
In most cases, using Cloud Audit Logs is now recommended instead of using legacyBucketWriter.
Logging to a separate Cloud Storage bucket with legacyBucketWriter produces csv files, which you would have to then load into BigQuery yourself to make them actionable, and this would be done far from in real time. Cloud Audit Logs are easier to set up and work with by comparison, and logs are delivered almost instantly.

Identifying and deleting S3 Objects that are not being accessed?

I have recently joined a company that uses S3 Buckets for various different projects within AWS. I want to identify and potentially delete S3 Objects that are not being accessed (read and write), in an effort to reduce the cost of S3 in my AWS account.
I read this, which helped me to some extent.
Is there a way to find out which objects are being accessed and which are not?
There is no native way of doing this at the moment, so all the options are workarounds depending on your usecase.
You have a few options:
Tag each S3 Object (e.g. 2018-10-24). First turn on Object Level Logging for your S3 bucket. Set up CloudWatch Events for CloudTrail. The Tag could then be updated by a Lambda Function which runs on a CloudWatch Event, which is fired on a Get event. Then create a function that runs on a Scheduled CloudWatch Event to delete all objects with a date tag prior to today.
Query CloudTrail logs on, write a custom function to query the last access times from Object Level CloudTrail Logs. This could be done with Athena, or a direct query to S3.
Create a Separate Index, in something like DynamoDB, which you update in your application on read activities.
Use a Lifecycle Policy on the S3 Bucket / key prefix to archive or delete the objects after x days. This is based on upload time rather than last access time, so you could copy the object to itself to reset the timestamp and start the clock again.
No objects in Amazon S3 are required by other AWS services, but you might have configured services to use the files.
For example, you might be serving content through Amazon CloudFront, providing templates for AWS CloudFormation or transcoding videos that are stored in Amazon S3.
If you didn't create the files and you aren't knowingly using the files, can you probably delete them. But you would be the only person who would know whether they are necessary.
There is recent AWS blog post which I found very interesting and cost optimized approach to solve this problem.
Here is the description from AWS blog:
The S3 server access logs capture S3 object requests. These are generated and stored in the target S3 bucket.
An S3 inventory report is generated for the source bucket daily. It is written to the S3 inventory target bucket.
An Amazon EventBridge rule is configured that will initiate an AWS Lambda function once a day, or as desired.
The Lambda function initiates an S3 Batch Operation job to tag objects in the source bucket. These must be expired using the following logic:
Capture the number of days (x) configuration from the S3 Lifecycle configuration.
Run an Amazon Athena query that will get the list of objects from the S3 inventory report and server access logs. Create a delta list with objects that were created earlier than 'x' days, but not accessed during that time.
Write a manifest file with the list of these objects to an S3 bucket.
Create an S3 Batch operation job that will tag all objects in the manifest file with a tag of "delete=True".
The Lifecycle rule on the source S3 bucket will expire all objects that were created prior to 'x' days. They will have the tag given via the S3 batch operation of "delete=True".
Expiring Amazon S3 Objects Based on Last Accessed Date to Decrease Costs

Amazon AWS Athena -- Removing Temp Files

Anyone know how to get rid of all the temporary files that get created in the S3 buckets when using Athena to query?
Is there some setting or option to disable these -- or criteria to filter how to remove them?
I'm using JDBC connection via linux to select from my S3 bucket.
Amazon Athena creates files in Amazon S3 with the output of all Athena queries. This is beneficial, because the output can then be used in a subsequent process. Also, it could avoid the need to re-run queries which is useful because Athena is charged based on data scanned for each query.
If you do not wish to keep these output files, or if you wish to remove them after a period of time, the easiest method is to configure Object Lifecycle Management on the Amazon S3 bucket. Simply create an expiration policy that deletes the files after a certain number of days. The files will then be deleted each night (or thereabouts).

process s3 access logs using AWS datapipeline

My usecase is to process S3 access logs(having those 18 fields) periodically and push to table in RDS. I'm using AWS data pipeline for this task to run everyday to process previous day's logs.
I decided to split the task into two activities
1. Shell Command Activity : To process s3 access logs and create a csv file
2. Hive Activity : To read data from csv file and insert to RDS table.
My input s3 bucket has lots of log files hence first activity fails due to out of memory error while staging. However i don't want to stage all the logs, staging the previous day's log is enough for me. I searched around internet but didn't get any solution. How do i achieve this ? Is my solution the optimal one ? Does any solution better than this exist ? Any suggestions will be helpful
Thanks in Advance
You can define your S3 data node use timestamps. For e.g. you can say the directory path is
s3://yourbucket/ #{format(#scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}
Since your log files should have a timestamp in the name (or they could be organized by timestamped directories).
This will only stage the files matching that pattern.
You may be recreating a solution that is already done by Logstash (or more precisely the ELK stack).
http://logstash.net/docs/1.4.2/inputs/s3
Logstash can consume S3 files.
Here is a thread on reading access logs from S3
https://groups.google.com/forum/#!topic/logstash-users/HqHWklNfB9A
We use Splunk (not-free) that has the same capabilities through its AWS plugin.
May I ask why are you pushing the access logs to RDS?
ELK might be a great solution for you. You can build it on your own or use ELK-as-a-service from Logz.io (I work for Logz.io).
It enables you to easily define an S3 bucket, get all your logs read regularly from the bucket and ingested by ELK and view them in preconfigured dashboards.