Firehose to S3 Vs Direct PUT to S3

Firehose to S3 Vs Direct PUT to S3 - amazon-web-services

I have a requirement to PUT about 20 records per second to an S3 bucket.
That comes to roughly 20 * 60 * 60 * 24 * 30 = 51,840,000 PUTs per month.
I do not need any transformations but I would certainly want the PUTs to be GZIPped and partitioned by year/date/month/hour.
Option 1 - Just do PUTObjects on S3
Price comes to about ~$260 a month
I would have to do the GZIP/Partitions etc on the client side
Option 2 - Introduce a Firehose and wire it to S3
And let's say I buffer only once in 10 minutes then that is about 6 * 24 * 30 = 4,320 PUTs. Price of S3 comes down to $21. With each record about 20 KB Firehose pricing is about 1000GB * 0.029 comes to about $30. So total pricing is $51. Costs for data transfer / storage etc are same in both approaches I believe.
Firehose provides GZip/Partitions/buffering for me OOTB
It appears like Option 2 is the best for my use case. Am I missing something here?
Thanks for looking!

Related

How to increase your Quicksight SPICE data refresh frequency

Quicksight only supports 24 refreshes / 24 Hrs for FULL REFRESH.
I want to refresh the data every 30 Mins.

Answer:
Scenario:
Let us say I want to fetch the data from the source (Jira) and push it to SPICE and render it in Quicksight Dashboards.
Requirement:
Push the data every 30 Mins once.
Quicksight supports the following:
Full refresh
Incremental refresh
Full refresh:
Process - Old data is replaced with new data.
Frequency - Every 1 Hr once
Refresh count - 24 / Day
Incremental refresh:
Process - New data get appended to the dataset.
Frequency - Every 15 Min once
Refresh count - 96 / Day
Issue:
We need to push the data every 30 Min once.
It is going to be a FULL_REFRESH
When it comes to Full Refresh Quicksight only supports Hourly refresh.
Solution:
We can leverage API support from AWS.
Package - Python Boto 3
Class - Quicksight.client
Method - create_ingestion
Process - You can manually refresh datasets by starting new SPICE ingestion.
Refresh cycle: Each 24-hour period is measured starting 24 hours before the current date and time.
Limitations:
Enterprise edition accounts 32 times in a 24-hour period.
Standard edition accounts 8 times in a 24-hour period.
Sample code:
Python - Boto for AWS:
import boto3
client = boto3.client('quicksight')
response = client.create_ingestion(
DataSetId='string',
IngestionId='string',
AwsAccountId='string',
IngestionType='INCREMENTAL_REFRESH'|'FULL_REFRESH'
)
awswrangler:
import awswrangler as wr
wr.quicksight.cancel_ingestion(ingestion_id="jira_data_sample_refresh", dataset_name="jira_db")
CLI:
aws quicksight create-ingestion --data-set-id dataSetId --ingestion-id jira_data_sample_ingestion --aws-account-id AwsAccountId --region us-east-1
API:
PUT /accounts/AwsAccountId/data-sets/DataSetId/ingestions/IngestionId HTTP/1.1
Content-type: application/json
{
"IngestionType": "string"
}
Conclusion:
Using this approach we can achieve 56 Full Refreshes for our dataset also we can go one step further and get the peak hours of our source tool (Jira) and configure the data refresh accordingly. This way we can even achieve a refresh frequency of 10 Min once.
Ref:
Quicksight
Quicksight Gallery
SPICE
Boto - Python
Boto - Create Ingestion
AWS Wrangler
CLI
API

Costs of enabling versioning in Amazon S3

I have a question about the costs of versioning in Amazon S3 that don't seem to be present in the guide. There is a cost for every PUT/POST, but for versioned objects(especially when you keep older versions in alternative storage such as glacier) does each PUT/POST cost 2x the PUT/POST cost, one for the new version then one to move the old version to glacier?

You can refer to FAQ page: https://aws.amazon.com/s3/faqs/?nc1=h_ls
Q: How am I charged for using Versioning?
Normal Amazon S3 rates apply for every version of an object stored or
requested. For example, let’s look at the following scenario to
illustrate storage costs when utilizing Versioning (let’s assume the
current month is 31 days long):
1) Day 1 of the month: You perform a PUT of 4 GB (4,294,967,296 bytes)
on your bucket. 2) Day 16 of the month: You perform a PUT of 5 GB
(5,368,709,120 bytes) within the same bucket using the same key as the
original PUT on Day 1.
When analyzing the storage costs of the above operations, please note
that the 4 GB object from Day 1 is not deleted from the bucket when
the 5 GB object is written on Day 15. Instead, the 4 GB object is
preserved as an older version and the 5 GB object becomes the most
recently written version of the object within your bucket. At the end
of the month:
Total Byte-Hour usage [4,294,967,296 bytes x 31 days x (24 hours /
day)] + [5,368,709,120 bytes x 16 days x (24 hours / day)] =
5,257,039,970,304 Byte-Hours.
Conversion to Total GB-Months 5,257,039,970,304 Byte-Hours x (1 GB /
1,073,741,824 bytes) x (1 month / 744 hours) = 6.581 GB-Month
The fee is calculated based on the current rates for your region on
the Amazon S3 Pricing page.

How much it cost to use Amazon S3 for Video Streaming backend?

We are developing a video streaming site
For that we want to use Amazon S3 storage
But I can't understand the pricing structure . The price calculator is also confusing
Please tell how much it will cost for below calculation
100 Videos uploaded via S3 Api .. Each having size of 500 MB.(Region LONDON)
=> So 100 x 500 = ~ 5GB storage used
each and every video requested (GET) 1000 times
=> 100 * 1000 = 100,000 GET requests
each and every video viewed 1000 times
=> So (500 x 1000) x 100 = ~ 5 TB bandwidth used
Now please say how much it will cost ? deduce the pricing step by step

AWS has an official pricing tool which is helpful for estimating a service cost.
The estimate for your S3 cost is: $448.69 / month
You can see the full workings, and update the calculation here.

100 Videos uploaded at 500mb each:
100 * $0.23 / 2 = $1.15
Each video requested 1,000 times:
(100 * 1,000) / 1,000 * $0.0004 = $0.04
Each video viewed 1,000 times:
S3 doesn’t charge for a video being viewed only a video being requested I believe.
So your total is $1.19
Here’s a link to s3 pricing for storage & request:
https://aws.amazon.com/s3/pricing/

AWS Lambda functions Cache. Best way to cache data by 24 hours

I am trying to cache the data (for every user I have in a list) from external URL request using AWS lambda function by almost 24 hours or less.
For instance:
cache.set(`response-${user}`, data, 24 * 60 * 60 * 100)
I am looking for the best way to save the response, I read that lambda has limited the storage for this.
Also, I read that I can use ElastiCache, but currently I am using memory-cache with node
Thank you,

s3 vs dynamoDB for gps data

I have the following situation that I try to find the best solution for.
A device writes its GPS coordinates every second to a csv file and uploads the file every x minutes to s3 before starting a new csv.
Later I want to be able to get the GPS data for a specific time period e.g 2016-11-11 8am until 2016-11-11 2pm
Here are two solutions that I am currently considering:
Use a lambda function to automatically save the csv data to a dynamoDB record
Only save the metadata (csv gps timestamp-start, timestamp-end, s3Filename) in dynamoDB and then request the files directly from s3.
However both solutions seem to have a major drawback:
The gps data uses about 40 bytes per record (second). So if I use 10min chunks this will result in a 24 kB file. dynamoDB charges write capacities by item size (1 write capacity unit = 1 kB). So this would require 24 units for a single write. Reads (4kB/unit) are even worse since a user may request timeframes greater than 10 min. So for a request covering e.g. 6 hours (=864kB) it would require a read capacity of 216. This will just be too expensive considering multiple users.
When I read directly from S3 I face the browser limiting the number of concurrent requests. The 6 hour timespan for instance would cover 36 files. This might still be acceptable, considering a connection limit of 6. But a request for 24 hours (=144 files) would just take too long.
Any idea how to solve the problem?
best regards, Chris

You can avoid using DynamoDB altogether if the S3 keys contain the date in a reasonable format (e.g. ISO: deviceid_2016-11-27T160732). This allows you to find the correct files by listing the object keys: https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysUsingAPIs.html.
(If you can not control the naming, you could use a Lambda function to rename the files.)
Number of requests is an issue, but you could try to put a CloudFront distribution in front of it and utilize HTTP/2, which allows the browser to request multiple files over the same connection.

Have you considered using AWS Firehose? Your data will be periodically shovelled into Redshift which is like Postgres. You just pump a JSON formatted or a | delimited record into an AWS Firehose end-point and the rest is magic by the little AWS elves.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js