Large Number of Requests While uploading a single file in AWS Glacier - amazon-web-services

I am new to AWS Glacier. I am trying to store some of my data into Amazon Glacier. I have uploaded a single file(archive) in a glacier, Now when I go to see the Billing Dashboard, It shows the Number of Requests is 20 and more.
I am unaware of what is happening. How the Number of Request hit will occur when I upload a single file. Is Number of Request changes according to the size of file?

Yes, the number of requests changes with the size of the file. The Glacier API supports multipart (with or without parallelism) uploads, and it requires that the part size be specified by the uploading code and constrained within some boundaries... so usually, your uploaded file will have multiple parts.
When you initiate a multipart upload, you specify the part size in number of bytes. The part size must be a megabyte (1024 KB) multiplied by a power of 2—for example, 1048576 (1 MB), 2097152 (2 MB), 4194304 (4 MB), 8388608 (8 MB), and so on. The minimum allowable part size is 1 MB, and the maximum is 4 GB.
https://docs.aws.amazon.com/amazonglacier/latest/dev/api-multipart-initiate-upload.html
A complete API cycle for one small object requires at 3 requests -- initiate, (one) part, and complete. Larger objects will require multiple "part" uploads depending on the part size. The SDK you are using may be choosing some default values here.

Related

Issue with reading millions of files from cloud storage using dataflow in Google cloud

Scenario: I am trying to read files and send the data to pub/sub
Millions of files stored in a cloud storage folder(GCP)
I have created a dataflow pipeline using the template "Text files on cloud storage to Pub/Sub" from the pub/sub topic
But the above template was not able to read millions of files and failed with the following error
java.lang.IllegalArgumentException: Total size of the BoundedSource objects generated by split() operation is larger than the allowable limit. When splitting gs://filelocation/data/*.json into bundles of 28401539859 bytes it generated 2397802 BoundedSource objects with total serialized size of 199603686 bytes which is larger than the limit 20971520.
System configuration:
Apache beam: 2.38 Java SDK
Machine: High performance n1-highmem-16
Any idea on how to solve this issue? Thanks in advance
According to this document (1) you can work around this by modifying your custom BoundedSource subclass so that the generated BoundedSource objects become smaller than the 20 MB limit.
(1) https://cloud.google.com/dataflow/docs/guides/common-errors#boundedsource-objects-splitintobundles
You can also use TextIO.readAll() to avoid these limitations.

AWS S3 files not uploaded properly if total files size exceed 50GB

For my web application I am uploading all the files of a selected directory, if the total files size of that directory is less than 50GB then all the files are uploaded correctly but If goes beyond that then some of the uploaded files size is not matching with the actual files size (less than the actual file size).
I am using AWS JavaScript SDK for this.
Any help/input appreciated.
Thanks!
if your single put operation is exceeding the size of 5gb you may observe such inconsistencies.
what aws says
The total volume of data and number of objects you can store are unlimited. Individual Amazon S3 objects can range in size from a minimum of 0 bytes to a maximum of 5 terabytes. The largest object that can be uploaded in a single PUT is 5 gigabytes.
for more 5 gb PUToperations consider using multipart upload

Doubts using Amazon S3 monthly calculator

I'm using Amazon S3 to store videos and some audios (average size of 25 mb each) and users of my web and android app (so far) can access them with no problem but I want to know how much I'll pay later exceeding the free stage of S3 so I checked the S3 monthly calculator.
I saw that there is 5 fields:
Storage: I put 3 gb cause right now there are 130 files (videos and audios)
PUT/COPY/POST/LIST Requests: I put 15 cause I'll upload manually around 10-15 files each month
GET/SELECT and Other Requests: I put 10000 cause a projection tells me that the users will watch/listen those files around 10000 times monthly
Data Returned by S3 Select: I put 250 Gb (10000 x 25 mb)
Data Scanned by S3 Select: I don't know what to put cause I don't need that amazon scans or analyze those files.
Am I using that calculator in a proper way?
What do I need to put in "Data Scanned by S3 Select"?
Can I put only zero?
For audio and video, you can definitely specify 0 for S3 Select -- both data scanned and data returned.
S3 Select is an optional feature that only works with certain types of text files -- like CSV and JSON -- where you make specific requests for S3 to scan through the files and return matching values, rather than you downloading the entire file and filtering it yourself.
This would not be used with audio or video files.
Also, don't overlook "Data transfer out." In addition to the "get" requests, you're billed for bandwidth when files are downloaded, so this needs to show the total size of all the downloads. This line item is data downloaded from S3 via the Internet.

What is the Maximum file size for using multipart upload in s3?

Any one have idea on What is the Maximum file size for using multipart upload in
s3?
because when i tried to upload 10 gb file its got stuck. There is no error message in the log.
can any one have a idea
Thanks in advance
The maximum size of an object you can store in an S3 bucket is 5TB so the maximum size of the file using multipart upload also would be 5TB.
Using the multipart upload API, you can upload large objects, up to 5 TB.
The multipart upload API is designed to improve the upload experience for larger objects. You can upload objects in parts. These object parts can be uploaded independently, in any order, and in parallel. You can use a multipart upload for objects from 5 MB to 5 TB in size.
Official documentation- http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObjects.html

s3 vs dynamoDB for gps data

I have the following situation that I try to find the best solution for.
A device writes its GPS coordinates every second to a csv file and uploads the file every x minutes to s3 before starting a new csv.
Later I want to be able to get the GPS data for a specific time period e.g 2016-11-11 8am until 2016-11-11 2pm
Here are two solutions that I am currently considering:
Use a lambda function to automatically save the csv data to a dynamoDB record
Only save the metadata (csv gps timestamp-start, timestamp-end, s3Filename) in dynamoDB and then request the files directly from s3.
However both solutions seem to have a major drawback:
The gps data uses about 40 bytes per record (second). So if I use 10min chunks this will result in a 24 kB file. dynamoDB charges write capacities by item size (1 write capacity unit = 1 kB). So this would require 24 units for a single write. Reads (4kB/unit) are even worse since a user may request timeframes greater than 10 min. So for a request covering e.g. 6 hours (=864kB) it would require a read capacity of 216. This will just be too expensive considering multiple users.
When I read directly from S3 I face the browser limiting the number of concurrent requests. The 6 hour timespan for instance would cover 36 files. This might still be acceptable, considering a connection limit of 6. But a request for 24 hours (=144 files) would just take too long.
Any idea how to solve the problem?
best regards, Chris
You can avoid using DynamoDB altogether if the S3 keys contain the date in a reasonable format (e.g. ISO: deviceid_2016-11-27T160732). This allows you to find the correct files by listing the object keys: https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysUsingAPIs.html.
(If you can not control the naming, you could use a Lambda function to rename the files.)
Number of requests is an issue, but you could try to put a CloudFront distribution in front of it and utilize HTTP/2, which allows the browser to request multiple files over the same connection.
Have you considered using AWS Firehose? Your data will be periodically shovelled into Redshift which is like Postgres. You just pump a JSON formatted or a | delimited record into an AWS Firehose end-point and the rest is magic by the little AWS elves.