Consuming/producing data to particular shardID in amazon Kinesis - amazon-web-services

I need to put the all the records into kinesis from various servers and need to output the data into multiple S3 Files. I have been trying with ShardID, but, not able to make it work out.
Could you please help????
Python/Java would be fine.

ShardID is not that important.
If you have 20 MB/sec input bandwidth with 20000 request/seconds rate; you should have 20 shards at least.
And with each shard, your data will be spread accross, so it is just about capacity. Those shards does not affect your input and output result. (It also affects parallelization with the help of hash - partition - key but that's another thing, I'm not explaining that not to confuse.)
You should be concerned about "put_record" or "put_records" methods in the producer (ie. input) part; and the record emitted (ie. output) on the consumer side. You should not worry about which shard has the record passed through, you just take the record on the consumer side and process with your business needs.
Using Kinesis Client Library ( https://github.com/awslabs/amazon-kinesis-client ) is the best for this abstraction.
There is also a sample project on GitHub Amazon Kinesis Connectors ( https://github.com/awslabs/amazon-kinesis-connectors ) that does consuming data and uploading it into S3.

Related

Best strategy to archive specific records from RDS to a cheaper storage in AWS

I have the following requirements:
For every deleted record in RDS we need to archive it into somewhere cheaper on AWS.
Reduce storage cost
Not using Glacier
Context oriented (e.g. a file per table)
re-import is not a requirement
I'm not an experienced user with AWS, so I'm still a bit lost among the amount of options it has to offer and I'd like to know if you have more ideas to help me clear it out.
Initial thoughts:
The microservice that deletes the record, might send it to a broker (RabbitMQ for e.g.) and another microservice (let's call it archiver) will listen to it, write into a file, zip and send to S3. This approach has some technical challenges though: in order to make sense create big files, I need to wait the queue to growth a bit, wrap it into a stream and zip inside S3. The transaction control is very weak as well, since file writing and ack on messages are signal based i.e. I'll remove the messages from the broker just after the file is created.
Add a new column to the "archiveble" tables as "deleted (bool)" and run a separate job fetching only those records and saving them into S3. Discarded they don't want the new microservice with access to other's databases.
Following the same approach as in the first item, but instead of save into S3, save into a cheaper database. SimpleDB?
option 1, but instead of rabbitmq, write it to a kinesis firehose and direct that to an s3 location - it doesn't get much cheaper or easier than that.

Architectural advice for AWS firehose or similar when collecting a lot of events in real-time

I would like to ask you about getting some advice about handling many application events on AWS. My application sends a lot of different events about everything what a user did in real-time. For collecting those events, I’m using AWS firehose (kinesis) - I have few data streams where I push some different events. Some events, before storing on S3/Redshift contains data which I want to extract and store to other databases (DynamoDB) or to other S3 files — for that case I’m using lambda which is assigned to a specific stream.
My problem is that business adds more and more new events which they need to collect or do something with data and for every new event or „group” events I need create separate data stream + s3/rs/es + lambda for extracting data. Also, events on S3 are stored in one format and there is not possible to group that events e.g. by userId from an application or even name of the event in the stream filename. Ideal s3 with that events would look like events/{user_id}/{date}/{event-name}{timestamp}.json.
Maybe I’m wrong using firehose or I have wrong thinking about firehose in my case, maybe there are other, better services on AWS for my case which can give me more control. Maybe simple SQS + lambdas as a listener on S3 is better solution in this case?
Thanks for any advice.
EDIT 12th Nov 2020
This was supposed to be a comment for #Lina, but it was too long to put a comment, so I updated my question with the solution which I pick.
I resolved my issue as I "felt", so it may not be a good way to repeat, but: I've written a nodejs routing application which I connected on firehose and I wrote a few microservices where data is sent from firehose by my routing app. So now, I have a firehose tube and I'm taking 10 different event types. When some event came, my routing application decides what microservice should be run with what data based on the event type (the raw firehose event is still stored on s3 automatically). This gives me needed flexibility as I can extract specific data from the event, do with that data what I need, by running every other microservices from the whole system and still have a raw event in the s3 in case of needed revert history of events.
Some of the events are not passing to any service, it is just stored as a raw s3 file e.g. application logs - I can do many things with that files on S3 PUT/CREATE event.
I hope that it will help someone with a similar problem.

Spark Streaming with S3 vs Kinesis

I'm writing a Spark Streaming application where the input data is put into an S3 bucket in small batches (using Database Migration Service - DMS). The Spark application is the only consumer. I'm considering two possible architectures:
Have Spark Streaming watch an S3 prefix and pick up new objects as they
come in
Stream data from S3 to a Kinesis stream (through a Lambda function triggered as new S3 objects are created by DMS) and use the stream as input for the Spark application.
While the second solution will work, the first solution is simpler. But are there any pitfalls? Looking at this guide, I'm concerned about two specific points:
The more files under a directory, the longer it will take to scan for changes — even if no files have been modified.
We will be keeping the S3 data indefinitely. So the number of objects under the prefix being monitored is going to increase very quickly.
“Full” Filesystems such as HDFS tend to set the modification time on their files as soon as the output stream is created. When a file is opened, even before data has been completely written, it may be included in the DStream - after which updates to the file within the same window will be ignored. That is: changes may be missed, and data omitted from the stream.
I'm not sure if this applies to S3, since to my understanding objects are created atomically and cannot be updated afterwards as is the case with ordinary files.
I posted this to Spark mailing list and got a good answer from Steve Loughran.
Theres a slightly-more-optimised streaming source for cloud streams
here
https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/org/apache/spark/streaming/hortonworks/CloudInputDStream.scala
Even so, the cost of scanning S3 is one LIST request per 5000 objects;
I'll leave it to you to work out how many there will be in your
application —and how much it will cost. And of course, the more LIST
calls tehre are, the longer things take, the bigger your window needs
to be.
“Full” Filesystems such as HDFS tend to set the modification time on their files as soon as the output stream is created. When a file is
opened, even before data has been completely written, it may be
included in the DStream - after which updates to the file within the
same window will be ignored. That is: changes may be missed, and data
omitted from the stream.
Objects written to S3 are't visible until the upload completes, in an
atomic operation. You can write in place and not worry.
The timestamp on S3 artifacts comes from the PUT tim. On multipart
uploads of many MB/many GB uploads, thats when the first post to
initiate the MPU is kicked off. So if the upload starts in time window
t1 and completed in window t2, the object won't be visible until t2,
but the timestamp will be of t1. Bear that in mind.
The lambda callback probably does have better scalability and
resilience; not tried it myself.
Since the number of objects in my scenario is going to be much larger than 5000 and will continue to grow very quickly, S3 to Spark doesn't seem to be a feasible option. I did consider moving/renaming processed objects in Spark Streaming, but the Spark Streaming application code seems to only receive DStreams and no information about which S3 object the data is coming from. So I'm going to go with the Lambda and Kinesis option.

s3 vs dynamoDB for gps data

I have the following situation that I try to find the best solution for.
A device writes its GPS coordinates every second to a csv file and uploads the file every x minutes to s3 before starting a new csv.
Later I want to be able to get the GPS data for a specific time period e.g 2016-11-11 8am until 2016-11-11 2pm
Here are two solutions that I am currently considering:
Use a lambda function to automatically save the csv data to a dynamoDB record
Only save the metadata (csv gps timestamp-start, timestamp-end, s3Filename) in dynamoDB and then request the files directly from s3.
However both solutions seem to have a major drawback:
The gps data uses about 40 bytes per record (second). So if I use 10min chunks this will result in a 24 kB file. dynamoDB charges write capacities by item size (1 write capacity unit = 1 kB). So this would require 24 units for a single write. Reads (4kB/unit) are even worse since a user may request timeframes greater than 10 min. So for a request covering e.g. 6 hours (=864kB) it would require a read capacity of 216. This will just be too expensive considering multiple users.
When I read directly from S3 I face the browser limiting the number of concurrent requests. The 6 hour timespan for instance would cover 36 files. This might still be acceptable, considering a connection limit of 6. But a request for 24 hours (=144 files) would just take too long.
Any idea how to solve the problem?
best regards, Chris
You can avoid using DynamoDB altogether if the S3 keys contain the date in a reasonable format (e.g. ISO: deviceid_2016-11-27T160732). This allows you to find the correct files by listing the object keys: https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysUsingAPIs.html.
(If you can not control the naming, you could use a Lambda function to rename the files.)
Number of requests is an issue, but you could try to put a CloudFront distribution in front of it and utilize HTTP/2, which allows the browser to request multiple files over the same connection.
Have you considered using AWS Firehose? Your data will be periodically shovelled into Redshift which is like Postgres. You just pump a JSON formatted or a | delimited record into an AWS Firehose end-point and the rest is magic by the little AWS elves.

Kinesis ProvisionedThroughputExceededException even after sufficient shards

We have facing ProvisionedThroughputExceededException issue while writing data on Kinesis stream.
Case 1:
We were used single m4.4xlarge (16 core, 64GB mem) instance to write data on stream pass 3k request from Jmeter, EC2 instance provides us 1100 request per second, So we choose 2 shard stream(i.e. 2000 eps).
In result we was able to write data on stream successfully without any loss.
Case 2:
For further testing we had created 10 EC2 m4.4xlarge (16 core, 64GB mem) cluster and 11 shard stream (based on simple calculation 1000eps for one shard, so 10 shard + 1 provision).
When we test that EC2 cluster with different request cases from Jmeter like 3, 10, 30 millions. We receive ProvisionedThroughputExceededException error on our log file.
On Jmeter side EC2 cluster provides us 7500eps and i believe with 7500eps stream having 11000eps capacity should not return such error.
Could you help me to understand reason behind this issue.
It sounds like Kinesis is not hashing/distributing your data evenly across your shards - some are "hot" (getting the ProvisionedThroughputExceededException), while others are "cold".
To solve this, I recommend
Use the ExplicitHashKey parameter in order to have control over which shards your data goes to. The PutRecords documentation has some basic info on this (but not as much as it should).
Also, make sure that your shards are evenly split across the hash space (appropriate starting/ending hash key).
The simplest pattern is just to have a single pre-defined ExplicitHashKey for each shard, and have your PutRecords logic just iterate through it for each record - perfectly even distribution. In any case, make sure your record hashing algorithm will distribute records evenly across the shards.
Another alternative/extension based on using ExplicitHashKey is to have a subset of your hashspace dedicated to "overflow" shard(s) - in your case, 1 specific ExplicitHashKey value mapped to one shard - when you start being throttled on your normal shards, send the records there for retry.
Check your producer side, are you sure you are inserting data to different shards? "PartitionKey" value in PutRecordRequest call may help you.
I think you need to pass different "Partition Keys" for records to share data between different "Shards".
Even if you have created multiple Shards and all of your records use the same partition key then you're still writing to a single shard, because they'll all have the same hash value. Check-out more here PartitionKey