I have a streaming pipeline, if i increase the Eventhub partitions from 64 to 128, will that lead to data loss? - azure-eventhub

I have Azure eventhubs created and running on a "Dedicated" cluster. Currently it is at 64 partitions, how do i ensure that there is no data loss when it is increased to 128 partitions?
Note:
Order of events does not matter in this scenario
Events can be written to any partitions(round robin fashion)
The consumer of this Eventhub is a function app running on I2:64 dedicated app service plan.

Partition scale-out should not cause data loss. Producers may see intermittent failures during scale-out however if your code is properly retrying failed operations then you should really not worry. For the sake of peace of mind, you can first execute a drill on a test eventhub and see scale-out handled w/o any issues.
Btw, make sure consumers are properly configured to receive from new partitions as well.

Related

How to slow down reads in Kinesis Consumer Library?

We have an aggregation system where the Aggregator is an KDA application running Flink which Aggregates the data over 6hrs time window and puts all the data into AWS Kinesis data Stream.
We also have an Consumer Application that uses KCL 2.x library and reads the data from KDS and puts the data into DynamoDB. We are using default KCL configuration and have set the poll time to 30 seconds. The issue that we are facing now is, Consumer application is reading all the data in KDS with in few minutes, causing huge writes in DynamoDB with in short period of time causing scaling issues in DynamoDB.
We would like to consume the KDS Data slowly and even out the data consumption across time allowing us to keep lower provisioned capacity for WCU's.
One way we can do that is increase the polling time for the KCL consumer application, I am trying to see if there is any configuration that can limit the number of records that we can poll, helping us to reduce the write throughput in dynamoDB or any other way to fix this problem?
Appreciate any responses

Recover PubSub Acked messages from Dataflow after a region loss

I have been reading about how DataFlow ack messages when reading data in streaming.
Based on the answers here and here, seems like DataFlow 'ack' the messages by bundle, as long as it finishes the bundle, then it will 'ack' the messages in it.
The confusion n is what will happen when there is a GroupByKeyinvolved in the pipeline. The data in the bundle will be persisted to a multi-regional bucket and the messages will be acknowledged. Then imagine the whole region goes down. The intermediate data will still be in the bucket (because us multi-regional).
That being said,
What are the steps to follow in order to not loose any data?
Any recommendation around how to handle this active/active approach in order to not loose data when a region is completely down?
Please advise,
With Dataflow and the current implementation of PubSubIO, achieving at-least-once delivery depends on checkpointed state being available. You must always drain your pipeline when cancelling; otherwise, checkpointed state may be lost. If a whole region became unavailable and you needed to start up the job in another region, I believe this would be equivalent to having the pipeline cancelled without draining.
We have several simple streaming Dataflow pipelines that read from PubSub and write to PubSub without ever invoking a GroupByKey, so no checkpoint state is involved and messages are only ack'd after being delivered to the output topic.
We have other pipelines that read from Pubsub and write to GCS or BigQuery. FileIO and BigQueryIO both include several GroupByKey operations, so we are vulnerable to data loss is checkpointed messages are dropped. We have had several occasions where these pipelines have gotten into an unrecoverable state that required cancelling. In those scenarios, we had to backfill a portion of data from an earlier stage of our data architecture.
At this point, Beam does not offer a solution for delaying acks of Pubsub messages across a GroupByKey, so you need to either accept that risk and build operational workflows that can recover from lost checkpointed state or work around the issue by sinking messages to a different data store outside of Beam.

Read millions of SQS message and persist in RDS

I have million of SQS message coming on daily basis. Currently we are reading same from various poller machines and writing same in RDBMS (Aurora PostgreSQL). Architecture has two flaw:
It is taking more than 10 hours to process all SQS messages. We are targeting 2-3 hours for same.
SQS messages are coming from a job. It is not continuous activity. Maintaining poller machines 24 hours is costing us.
We have already configured SQS NumberOfPollers to 20 and MessageFetchSize to 10.
My questions are:
Apart from NumberOfPollers and MessageFetchSize, is there any other SQS configuration parameter we can use to speed the process?
How to calculate correct value of NumberOfPollers and MessageFetchSize? We are just doing try and error in this.
Can we utilize EMR-Spark to do allocate machine on demand and run poller and terminate it after execution, so that we need not to maintain machine 24x7?
Any other suggestion/ways to achieve same

Estimate SQS processing time and load

I am going to use AWS SQS(regular queue, not FIFO) to process different client side metrics.
I’m expect to have ~400 messages per second (worst case).My SQS message will contain S3 location of the file.
I created an application, which will listen to my SQS Queue, and process messages from it.
By process I mean:
read SQS message ->
take S3 location from that SQS message ->
call S3 client ->
Read that file ->
Add a few additional fields —>
Publish data from this file to AWS Kinesis Firehose.
Similar process will be for each SQS message in the Queue. The size of S3 file is small, less than 0,5 KB.
How can calculate if I will be able to process those 400 messages per second? How can I estimate that my solution would handle x5 increase in data?
How can calculate if I will be able to process those 400 messages per second? How can I estimate that my solution would handle x5 increase in data?
Test it! Start with a small scale, and do the math to extrapolate from there. Make your test environment as close to what it will be in production as feasible.
On a single host and single thread, the math is simple:
1000 / AvgTotalTimeMillis = AvgMessagesPerSecond, or
1000 / AvgMessagesPerSecond = AvgTotalTimeMillis
How to approach testing this:
Start with a single thread and host, and generate some timing metrics for each step that you outlined, along with a total time.
Figure out your average/max/min time, and how many messages per second that translates to
400 messages per second on a single thread & host would be under 3ms per message. Hopefully this makes it obvious you need multiple threads/hosts.
Scale up!
Now that you know how much a single thread can handle, figure out how many threads a single host can effectively handle (you'll need to experiment). Consider batching messages where possible - SQS provides batch operations.
Use math to calculate how many hosts you need
If you need 5X that number, go up from there
While you're doing this math, consider any limits of the systems you're using:
Review the throttling limits of SQS / S3 / Firehose / etc. If you plan to use Lambda to do the work instead of EC2, it has limits too. Make sure you're within those limits, and consider contacting AWS support if you are close to exceeding them.
A few other suggestions based on my experience:
Based on your workflow outline & details, using EC2 you can probably handle a decent number of threads per host
M5.large should be more than enough - you can probably go smaller, as the performance bottleneck will likely be networking I/O to fetch and send messages.
Consider using autoscaling to handle message spikes for when you need to increase throughput, though keep in mind autoscaling can take several minutes to kick in.
The only way to determine this is to create a test environment that mirrors your scenario.
If your solution is designed to handle messages in parallel, it should be possible to scale-up your system to handle virtually any workload.
A good architecture would be to use AWS Lambda functions to process the messages. Lambda defaults to 1000 concurrent functions. So, if a function takes 3 seconds to run, it would support 333 messages per second consistently. You can request for the Lambda concurrency to be increased to handle higher workloads.
If you are using Amazon EC2 instead of Lambda functions, then it would just be a matter of scaling-out and adding more EC2 instances with more workers to handle whatever workload you desired.

Orleans EventHub stream provider

I am using the EventHubStream provider in a project based on Orleans.
After the system has been running a few minutes Orleans starts throwing a QueueCacheMissException while trying to push an event to OnNext from a producer.
i have tried to increase the size of the cache but that helped only for a while.
Is this a normal behavior due to the size of the cache?
In this situation should i unsubscribe and subscribe again? i have tried to resume stream but that didn't work, the stream was in faulted state... any ideas?
It is likely that the service is reading events from eventhub faster than grains are processing them. EventHub can deliver events at a rate of ~1k/second per partition.
The latest version of the EventHub stream provider supports backpressure that should prevent this problem, but it has not been released. You can however build your own nugets.