Maximum no.of connections that can be held by s3 - amazon-web-services

I am learning about Amazon Web services. I just want to know what is the maximum number of connections(roughly) that can be held by Amazon S3 simultaneously without crashing...

Theoretically this is infinite. To achieve this, they use a partitioning scheme they explain here: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
Basically they partition your buckets on different servers based on the first few characters of the filename. If those are random, you scale indefinitely (they just take more characters to partition on). If you prepend all files with file_ or something (so S3 cannot partition the files correctly because all files have the same starting characters), the limit is about 300 GET / sec or 100 PUT/DELETE/POST per second.
See that page for an in-depth explanation.

Given the AWS documentation you will receive HTTP 503 Slow Down over 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second.
The limit has been increased in July 2018.
More information :
https://aws.amazon.com/en/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/
https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html

Related

Can I put each s3 object into its own prefix to maximize throughput

S3's throughput limits are per-prefix, not per-object:
your application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in a bucket
AWS Docs
It seems to follow that if I place each of my s3 objects in a prefix (by itself), I'll effectively be able to have the above throughput per-object.
However I'm a bit suspicious of this conclusion. If it is true, why would someone consider s3 write sharding which is a bit more complicated?
To clarify, what I'm considering is whenever I'm about to save an object in s3 (eg. foo/bar/baz.txt), I save add a folder so that the file has its own prefix (eg. foo/bar/baz.txt/baz.txt). Now I can have 5500 reads per second on the object baz.txt (without the prefix, those 5500 reads per second would be shared across all objects in foo/bar/).

Max file count using big query data transfer job

I have about 54 000 files in my GCP bucket. When I try to schedule a big query data transfer job to move files from GCP bucket to big query, I am getting the following error:
Error code 9 : Transfer Run limits exceeded. Max size: 15.00 TB. Max file count: 10000. Found: size = 267065994 B (0.00 TB) ; file count = 54824.
I thought the max file count was 10 million.
I think that BigQuery transfer service lists all the files matching the wildcard and then use the list to load them. So it will be same that providing the full list to bq load ... therefore reaching the 10,000 URIs limit.
This is probably necessary because BigQuery transfer service will skip already loaded files, so it needs to look them one by one to decide which to actually load.
I think that your only option is to schedule a job yourself and load them directly into BigQuery. For example using Cloud Composer or writing a little cloud run service that can be invoked by Cloud Scheduler.
The Error message Transfer Run limits exceeded as mentioned before is related to a known limit for Load jobs in BigQuery. Unfortunately this is a hard limit and cannot be changed. There is an ongoing Feature Request to increase this limit but for now there is no ETA for it to be implemented.
The main recommendation for this issue is to split a single operation in multiple processes that will send data in requests that don't exceed this limit. With this we could cover the main question: "Why I see this Error message and how to avoid it?".
Is is normal to ask now "how to automate or perform these actions easier?" I can think of involve more products:
Dataflow, which will help you to process the data that will be added to BigQuery. Here is where you can send multiple requests.
Pub/Sub, will help to listen to events and automate the times where the processing will start.
Please, take a look at this suggested implementation where the aforementioned scenario is wider described.
Hope this is helpful! :)

S3 objects without prefix performance

I'm trying to find out whether storing objects with randomized keys and no "prefix" will give me S3 max performacne of 5500 Get/sec per object or since I don't have a prefix all those objects without prefix fall into a "no-prefix" category and share the 5500 limit.
Example: The following objects are stored directly in a bucket
njfoia74G.obj
njfoia74G.obj
njfoia74G.obj
will I get 5500 Get/Sec for each object or do they share that?
S3 documentation suggests that ky are not part of the prefix so not sure how to calculate throughput for those objects.
https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html#object-keys
Has anyone done a benchmark or have documentation that can answer this?
From Request Rate and Performance Guidelines - Amazon Simple Storage Service:
Your application can achieve at least 3,500 PUT/POST/DELETE and 5,500 GET requests per second per prefix in a bucket.
The root of a bucket is effectively an empty prefix, so all objects in the root would share the limit.
By the way very few systems would approach anywhere near these volumes. If you have millions of users (causing over 10 million requests per hour), then definitely implement some of the recommended techniques. But the vast majority of sites will never need to worry about it.

s3 vs dynamoDB for gps data

I have the following situation that I try to find the best solution for.
A device writes its GPS coordinates every second to a csv file and uploads the file every x minutes to s3 before starting a new csv.
Later I want to be able to get the GPS data for a specific time period e.g 2016-11-11 8am until 2016-11-11 2pm
Here are two solutions that I am currently considering:
Use a lambda function to automatically save the csv data to a dynamoDB record
Only save the metadata (csv gps timestamp-start, timestamp-end, s3Filename) in dynamoDB and then request the files directly from s3.
However both solutions seem to have a major drawback:
The gps data uses about 40 bytes per record (second). So if I use 10min chunks this will result in a 24 kB file. dynamoDB charges write capacities by item size (1 write capacity unit = 1 kB). So this would require 24 units for a single write. Reads (4kB/unit) are even worse since a user may request timeframes greater than 10 min. So for a request covering e.g. 6 hours (=864kB) it would require a read capacity of 216. This will just be too expensive considering multiple users.
When I read directly from S3 I face the browser limiting the number of concurrent requests. The 6 hour timespan for instance would cover 36 files. This might still be acceptable, considering a connection limit of 6. But a request for 24 hours (=144 files) would just take too long.
Any idea how to solve the problem?
best regards, Chris
You can avoid using DynamoDB altogether if the S3 keys contain the date in a reasonable format (e.g. ISO: deviceid_2016-11-27T160732). This allows you to find the correct files by listing the object keys: https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysUsingAPIs.html.
(If you can not control the naming, you could use a Lambda function to rename the files.)
Number of requests is an issue, but you could try to put a CloudFront distribution in front of it and utilize HTTP/2, which allows the browser to request multiple files over the same connection.
Have you considered using AWS Firehose? Your data will be periodically shovelled into Redshift which is like Postgres. You just pump a JSON formatted or a | delimited record into an AWS Firehose end-point and the rest is magic by the little AWS elves.

Kinesis ProvisionedThroughputExceededException even after sufficient shards

We have facing ProvisionedThroughputExceededException issue while writing data on Kinesis stream.
Case 1:
We were used single m4.4xlarge (16 core, 64GB mem) instance to write data on stream pass 3k request from Jmeter, EC2 instance provides us 1100 request per second, So we choose 2 shard stream(i.e. 2000 eps).
In result we was able to write data on stream successfully without any loss.
Case 2:
For further testing we had created 10 EC2 m4.4xlarge (16 core, 64GB mem) cluster and 11 shard stream (based on simple calculation 1000eps for one shard, so 10 shard + 1 provision).
When we test that EC2 cluster with different request cases from Jmeter like 3, 10, 30 millions. We receive ProvisionedThroughputExceededException error on our log file.
On Jmeter side EC2 cluster provides us 7500eps and i believe with 7500eps stream having 11000eps capacity should not return such error.
Could you help me to understand reason behind this issue.
It sounds like Kinesis is not hashing/distributing your data evenly across your shards - some are "hot" (getting the ProvisionedThroughputExceededException), while others are "cold".
To solve this, I recommend
Use the ExplicitHashKey parameter in order to have control over which shards your data goes to. The PutRecords documentation has some basic info on this (but not as much as it should).
Also, make sure that your shards are evenly split across the hash space (appropriate starting/ending hash key).
The simplest pattern is just to have a single pre-defined ExplicitHashKey for each shard, and have your PutRecords logic just iterate through it for each record - perfectly even distribution. In any case, make sure your record hashing algorithm will distribute records evenly across the shards.
Another alternative/extension based on using ExplicitHashKey is to have a subset of your hashspace dedicated to "overflow" shard(s) - in your case, 1 specific ExplicitHashKey value mapped to one shard - when you start being throttled on your normal shards, send the records there for retry.
Check your producer side, are you sure you are inserting data to different shards? "PartitionKey" value in PutRecordRequest call may help you.
I think you need to pass different "Partition Keys" for records to share data between different "Shards".
Even if you have created multiple Shards and all of your records use the same partition key then you're still writing to a single shard, because they'll all have the same hash value. Check-out more here PartitionKey