Doubts using Amazon S3 monthly calculator - amazon-web-services

I'm using Amazon S3 to store videos and some audios (average size of 25 mb each) and users of my web and android app (so far) can access them with no problem but I want to know how much I'll pay later exceeding the free stage of S3 so I checked the S3 monthly calculator.
I saw that there is 5 fields:
Storage: I put 3 gb cause right now there are 130 files (videos and audios)
PUT/COPY/POST/LIST Requests: I put 15 cause I'll upload manually around 10-15 files each month
GET/SELECT and Other Requests: I put 10000 cause a projection tells me that the users will watch/listen those files around 10000 times monthly
Data Returned by S3 Select: I put 250 Gb (10000 x 25 mb)
Data Scanned by S3 Select: I don't know what to put cause I don't need that amazon scans or analyze those files.
Am I using that calculator in a proper way?
What do I need to put in "Data Scanned by S3 Select"?
Can I put only zero?

For audio and video, you can definitely specify 0 for S3 Select -- both data scanned and data returned.
S3 Select is an optional feature that only works with certain types of text files -- like CSV and JSON -- where you make specific requests for S3 to scan through the files and return matching values, rather than you downloading the entire file and filtering it yourself.
This would not be used with audio or video files.
Also, don't overlook "Data transfer out." In addition to the "get" requests, you're billed for bandwidth when files are downloaded, so this needs to show the total size of all the downloads. This line item is data downloaded from S3 via the Internet.

Related

Find total data transferred for a video in AWS CloudFront

I am using CloudFront to distribute HLS video streams. The original video files that CloudFront uses are broken down into thousands of .ts files that are stored in S3 buckets. CloudFront Reports only seem to show total bytes transferred for the top 50 .ts files. Is it possible to find the total bytes transferred from CloudFront for an entire video? I am not interested in the amount of data transferred for only a selection of .ts files. Id like to see the total bytes transferred for the total video folder from which those .ts files are stored.
You can find statistics under CloudFront-> Usage Reports
CloudFront Usage Reports - Data Transferred by Destination

Amazon Macie to read database data

I am doing some POC in Amazon Macie. I got from the documentation that it identifies PII data like credit card. Even I ran an example where I put some valid credit card numbers in CSV and put into S3 bucket and was identified by Macie.
I want to know if the same PII data is under some database backup/dump file, which is in S3 bucket. Will Macie be able to identify? I didn't find anything in the documentation.
So a couple of things are important here
Macie can only handle certain types of files and certain compression formats
If you specify S3 buckets that include files of a format that isn't supported in Macie, Macie doesn't classify them.
Compression formats
https://docs.aws.amazon.com/macie/latest/userguide/macie-compression-archive-formats.html
Encrypted Objects
Macie can only handle certain types of encrypted Amazon S3 objects
See the following link for more details:
https://docs.aws.amazon.com/macie/latest/userguide/macie-integration.html#macie-encrypted-objects
Macie Limits
Macie has a default limit on the amount of data that it can classify in an account. After this data limit is reached, Macie stops classifying the data. The default data classification limit is 3 TB. This can be increased if requested.
Macie's content classification engine processes up to the first 20 MB of an S3 object.
So specifically if you dump is compressed but in a suitable format inside the compression then yes Macie can classify, but on an important note it will only classify the first 20 MB of the file which is a problem if the file is large.
Typically I use lambda to split a large file into files just under 20 MB. You still need to think if you have X number of files how do you take a record from a file that has been classified as PII and map it back into something that is useable.

Preferred internet connection to transport of 4 TB CSV files to Amazon (AWS)

I need to transfer bunch of CSV files with around 4 TB in total to AWS.
What is preferred internet connection from my ISP which can handle this transfer or
link does not play any role. My link is 70 Mbps Upload/Download Dedicated. Is this enough or I need to increase my link speed?
Thnx.
4 TB = 4,194,304 mbyte
70 mbit/sec ~= 8.75 mbyte/sec (approximate because there will be network overheads)
Dividing results in 479,349 seconds, or 5.55 days
Increasing your link speed will certainly improve this, but you'll probably find that you get more improvement using compression (CSV implies text with a numeric bias, which compresses extremely well).
You don't say what you'll be uploading to, nor how you'll be using the results. If you're uploading to S3, I'd suggest using GZip (or another compression format) to compress the files before uploading, and then let the consumers decompress as needed. If you're uploading to EFS, I'd create an EC2 instance to receive the files and use rsync with the -z option (which will compress over the wire but leave the files uncompressed on the destination). Of course, you may still prefer pre-compressing the files, to save on long-term storage costs.

s3 vs dynamoDB for gps data

I have the following situation that I try to find the best solution for.
A device writes its GPS coordinates every second to a csv file and uploads the file every x minutes to s3 before starting a new csv.
Later I want to be able to get the GPS data for a specific time period e.g 2016-11-11 8am until 2016-11-11 2pm
Here are two solutions that I am currently considering:
Use a lambda function to automatically save the csv data to a dynamoDB record
Only save the metadata (csv gps timestamp-start, timestamp-end, s3Filename) in dynamoDB and then request the files directly from s3.
However both solutions seem to have a major drawback:
The gps data uses about 40 bytes per record (second). So if I use 10min chunks this will result in a 24 kB file. dynamoDB charges write capacities by item size (1 write capacity unit = 1 kB). So this would require 24 units for a single write. Reads (4kB/unit) are even worse since a user may request timeframes greater than 10 min. So for a request covering e.g. 6 hours (=864kB) it would require a read capacity of 216. This will just be too expensive considering multiple users.
When I read directly from S3 I face the browser limiting the number of concurrent requests. The 6 hour timespan for instance would cover 36 files. This might still be acceptable, considering a connection limit of 6. But a request for 24 hours (=144 files) would just take too long.
Any idea how to solve the problem?
best regards, Chris
You can avoid using DynamoDB altogether if the S3 keys contain the date in a reasonable format (e.g. ISO: deviceid_2016-11-27T160732). This allows you to find the correct files by listing the object keys: https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysUsingAPIs.html.
(If you can not control the naming, you could use a Lambda function to rename the files.)
Number of requests is an issue, but you could try to put a CloudFront distribution in front of it and utilize HTTP/2, which allows the browser to request multiple files over the same connection.
Have you considered using AWS Firehose? Your data will be periodically shovelled into Redshift which is like Postgres. You just pump a JSON formatted or a | delimited record into an AWS Firehose end-point and the rest is magic by the little AWS elves.

s3zipper has limited to download only 1000 files from s3 bucket

I am using s3zipper along with PHP to stream zip S3 files. However there is one issue. We have more than 1000 of files to download (approx 2K to 10K varying). So when we send request to s3zipper lets say 1500 files, we were getting only 1000 files within a zip.
As per AWS docs they have 1000 keys limitation i.e.
S3 API version 2 implementation of the GET operation returns some or all (up to 1,000) of the objects in a bucket.
. So if we want to get more than than we have to use marker parameter AWS A. But in s3zipper.go this call aws_bucket.GetReader(file.S3Path), is reading file and adding to zip.I am not sure how I can use marker in this case.
I am curious how we can come over from this limitation. I am newbie to Go language, any help in this regard will be highly appreciated.