Kinesis ProvisionedThroughputExceededException even after sufficient shards - amazon-web-services

We have facing ProvisionedThroughputExceededException issue while writing data on Kinesis stream.
Case 1:
We were used single m4.4xlarge (16 core, 64GB mem) instance to write data on stream pass 3k request from Jmeter, EC2 instance provides us 1100 request per second, So we choose 2 shard stream(i.e. 2000 eps).
In result we was able to write data on stream successfully without any loss.
Case 2:
For further testing we had created 10 EC2 m4.4xlarge (16 core, 64GB mem) cluster and 11 shard stream (based on simple calculation 1000eps for one shard, so 10 shard + 1 provision).
When we test that EC2 cluster with different request cases from Jmeter like 3, 10, 30 millions. We receive ProvisionedThroughputExceededException error on our log file.
On Jmeter side EC2 cluster provides us 7500eps and i believe with 7500eps stream having 11000eps capacity should not return such error.
Could you help me to understand reason behind this issue.

It sounds like Kinesis is not hashing/distributing your data evenly across your shards - some are "hot" (getting the ProvisionedThroughputExceededException), while others are "cold".
To solve this, I recommend
Use the ExplicitHashKey parameter in order to have control over which shards your data goes to. The PutRecords documentation has some basic info on this (but not as much as it should).
Also, make sure that your shards are evenly split across the hash space (appropriate starting/ending hash key).
The simplest pattern is just to have a single pre-defined ExplicitHashKey for each shard, and have your PutRecords logic just iterate through it for each record - perfectly even distribution. In any case, make sure your record hashing algorithm will distribute records evenly across the shards.
Another alternative/extension based on using ExplicitHashKey is to have a subset of your hashspace dedicated to "overflow" shard(s) - in your case, 1 specific ExplicitHashKey value mapped to one shard - when you start being throttled on your normal shards, send the records there for retry.

Check your producer side, are you sure you are inserting data to different shards? "PartitionKey" value in PutRecordRequest call may help you.

I think you need to pass different "Partition Keys" for records to share data between different "Shards".
Even if you have created multiple Shards and all of your records use the same partition key then you're still writing to a single shard, because they'll all have the same hash value. Check-out more here PartitionKey

Related

Difference in default partitioning by instance type

My understanding was that spark will choose the 'default' number of partitions, solely based on the size of the file or if its a union of many parquet files, the number of parts.
However, in reading in a set of large parquet files, I see the that default # of partitions for an EMR cluster with a single d2.2xlarge is ~1200. However, in a cluster of 2 r3.8xlarge I'm getting default partitions of ~4700.
What metrics does Spark use to determine the default partitions?
EMR 5.5.0
spark.default.parallelism - Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.
2X number of CPU cores available to YARN containers.
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#spark-defaults
Looks like it matches non EMR/AWS Spark as well
I think there was some transient issue because I restarted that EMR cluster with d2.2xlarge and it gave me the number of partitions I expected, which matched the r3.8xlarge, which was the number of files on s3.
If anyone knows why this kind of things happens though, I'll gladly mark yours as the answer.

AWS - Aurora replicas

Scenario:
I have two reader-aurora replicas.
I make many calls to my system (high load)
I see only one replica working at 99.30%, but the other one is not doing
anything at all
Why?, is because this second replica is ONLY to prevent failures of the first one?, cannot be possible to make both to share the load?
In your RDS console, you should be able to look at each of the 3 instances
aurora-databasecluster-xxx.cluster-yyy.us-east-1.rds.amazonaws.com:3306
zz0.yyy.us-east-1.rds.amazonaws.com:3306
zz1.yyy.us-east-1.rds.amazonaws.com:3306
If you look at the cluster tab you will see two end points and the 2nd is the following:
aurora-databasecluster-xxx.cluster-ro-yyy.us-east-1.rds.amazonaws.com
Aurora allows you do either explicitly get to specific read replica. This would allow a set of read only nodes for OLTP performance and another set for data analysis - with long running queries that won't impact performance.
If you use the -ro end point, it should balance cross all read only nodes or you can have your code take a list of read only connection strings and do your own randomizer. I would have expected the ro to be better...but I am not yet familiar on their load balancing technique (fewest connections, round robin, etc)

s3 vs dynamoDB for gps data

I have the following situation that I try to find the best solution for.
A device writes its GPS coordinates every second to a csv file and uploads the file every x minutes to s3 before starting a new csv.
Later I want to be able to get the GPS data for a specific time period e.g 2016-11-11 8am until 2016-11-11 2pm
Here are two solutions that I am currently considering:
Use a lambda function to automatically save the csv data to a dynamoDB record
Only save the metadata (csv gps timestamp-start, timestamp-end, s3Filename) in dynamoDB and then request the files directly from s3.
However both solutions seem to have a major drawback:
The gps data uses about 40 bytes per record (second). So if I use 10min chunks this will result in a 24 kB file. dynamoDB charges write capacities by item size (1 write capacity unit = 1 kB). So this would require 24 units for a single write. Reads (4kB/unit) are even worse since a user may request timeframes greater than 10 min. So for a request covering e.g. 6 hours (=864kB) it would require a read capacity of 216. This will just be too expensive considering multiple users.
When I read directly from S3 I face the browser limiting the number of concurrent requests. The 6 hour timespan for instance would cover 36 files. This might still be acceptable, considering a connection limit of 6. But a request for 24 hours (=144 files) would just take too long.
Any idea how to solve the problem?
best regards, Chris
You can avoid using DynamoDB altogether if the S3 keys contain the date in a reasonable format (e.g. ISO: deviceid_2016-11-27T160732). This allows you to find the correct files by listing the object keys: https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysUsingAPIs.html.
(If you can not control the naming, you could use a Lambda function to rename the files.)
Number of requests is an issue, but you could try to put a CloudFront distribution in front of it and utilize HTTP/2, which allows the browser to request multiple files over the same connection.
Have you considered using AWS Firehose? Your data will be periodically shovelled into Redshift which is like Postgres. You just pump a JSON formatted or a | delimited record into an AWS Firehose end-point and the rest is magic by the little AWS elves.

Maximum no.of connections that can be held by s3

I am learning about Amazon Web services. I just want to know what is the maximum number of connections(roughly) that can be held by Amazon S3 simultaneously without crashing...
Theoretically this is infinite. To achieve this, they use a partitioning scheme they explain here: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
Basically they partition your buckets on different servers based on the first few characters of the filename. If those are random, you scale indefinitely (they just take more characters to partition on). If you prepend all files with file_ or something (so S3 cannot partition the files correctly because all files have the same starting characters), the limit is about 300 GET / sec or 100 PUT/DELETE/POST per second.
See that page for an in-depth explanation.
Given the AWS documentation you will receive HTTP 503 Slow Down over 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second.
The limit has been increased in July 2018.
More information :
https://aws.amazon.com/en/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/
https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html

Consuming/producing data to particular shardID in amazon Kinesis

I need to put the all the records into kinesis from various servers and need to output the data into multiple S3 Files. I have been trying with ShardID, but, not able to make it work out.
Could you please help????
Python/Java would be fine.
ShardID is not that important.
If you have 20 MB/sec input bandwidth with 20000 request/seconds rate; you should have 20 shards at least.
And with each shard, your data will be spread accross, so it is just about capacity. Those shards does not affect your input and output result. (It also affects parallelization with the help of hash - partition - key but that's another thing, I'm not explaining that not to confuse.)
You should be concerned about "put_record" or "put_records" methods in the producer (ie. input) part; and the record emitted (ie. output) on the consumer side. You should not worry about which shard has the record passed through, you just take the record on the consumer side and process with your business needs.
Using Kinesis Client Library ( https://github.com/awslabs/amazon-kinesis-client ) is the best for this abstraction.
There is also a sample project on GitHub Amazon Kinesis Connectors ( https://github.com/awslabs/amazon-kinesis-connectors ) that does consuming data and uploading it into S3.