I am using the EventHubStream provider in a project based on Orleans.
After the system has been running a few minutes Orleans starts throwing a QueueCacheMissException while trying to push an event to OnNext from a producer.
i have tried to increase the size of the cache but that helped only for a while.
Is this a normal behavior due to the size of the cache?
In this situation should i unsubscribe and subscribe again? i have tried to resume stream but that didn't work, the stream was in faulted state... any ideas?
It is likely that the service is reading events from eventhub faster than grains are processing them. EventHub can deliver events at a rate of ~1k/second per partition.
The latest version of the EventHub stream provider supports backpressure that should prevent this problem, but it has not been released. You can however build your own nugets.
Related
We have an aggregation system where the Aggregator is an KDA application running Flink which Aggregates the data over 6hrs time window and puts all the data into AWS Kinesis data Stream.
We also have an Consumer Application that uses KCL 2.x library and reads the data from KDS and puts the data into DynamoDB. We are using default KCL configuration and have set the poll time to 30 seconds. The issue that we are facing now is, Consumer application is reading all the data in KDS with in few minutes, causing huge writes in DynamoDB with in short period of time causing scaling issues in DynamoDB.
We would like to consume the KDS Data slowly and even out the data consumption across time allowing us to keep lower provisioned capacity for WCU's.
One way we can do that is increase the polling time for the KCL consumer application, I am trying to see if there is any configuration that can limit the number of records that we can poll, helping us to reduce the write throughput in dynamoDB or any other way to fix this problem?
Appreciate any responses
I have Azure eventhubs created and running on a "Dedicated" cluster. Currently it is at 64 partitions, how do i ensure that there is no data loss when it is increased to 128 partitions?
Note:
Order of events does not matter in this scenario
Events can be written to any partitions(round robin fashion)
The consumer of this Eventhub is a function app running on I2:64 dedicated app service plan.
Partition scale-out should not cause data loss. Producers may see intermittent failures during scale-out however if your code is properly retrying failed operations then you should really not worry. For the sake of peace of mind, you can first execute a drill on a test eventhub and see scale-out handled w/o any issues.
Btw, make sure consumers are properly configured to receive from new partitions as well.
I have been reading about how DataFlow ack messages when reading data in streaming.
Based on the answers here and here, seems like DataFlow 'ack' the messages by bundle, as long as it finishes the bundle, then it will 'ack' the messages in it.
The confusion n is what will happen when there is a GroupByKeyinvolved in the pipeline. The data in the bundle will be persisted to a multi-regional bucket and the messages will be acknowledged. Then imagine the whole region goes down. The intermediate data will still be in the bucket (because us multi-regional).
That being said,
What are the steps to follow in order to not loose any data?
Any recommendation around how to handle this active/active approach in order to not loose data when a region is completely down?
Please advise,
With Dataflow and the current implementation of PubSubIO, achieving at-least-once delivery depends on checkpointed state being available. You must always drain your pipeline when cancelling; otherwise, checkpointed state may be lost. If a whole region became unavailable and you needed to start up the job in another region, I believe this would be equivalent to having the pipeline cancelled without draining.
We have several simple streaming Dataflow pipelines that read from PubSub and write to PubSub without ever invoking a GroupByKey, so no checkpoint state is involved and messages are only ack'd after being delivered to the output topic.
We have other pipelines that read from Pubsub and write to GCS or BigQuery. FileIO and BigQueryIO both include several GroupByKey operations, so we are vulnerable to data loss is checkpointed messages are dropped. We have had several occasions where these pipelines have gotten into an unrecoverable state that required cancelling. In those scenarios, we had to backfill a portion of data from an earlier stage of our data architecture.
At this point, Beam does not offer a solution for delaying acks of Pubsub messages across a GroupByKey, so you need to either accept that risk and build operational workflows that can recover from lost checkpointed state or work around the issue by sinking messages to a different data store outside of Beam.
I am trying to use AWS Kinesis stream for one of our data streams. I would like to monitor pending messages on my stream for ops purposes(scale downstream according to backlog), but unable to find any API that gives (approx) pending messages in my stream.
This looks strange as messages get expired after 7 days and if the producers and consumers are isolated and can't communicate, how do you know messages are expiring. How do you handle this problem?
Thanks!
There is no such concept as "pending" message in Kinesis. All the incoming data will be placed on a shard.
Your consumer application should be in running state all the time, to keep track of changes in your stream. The application (with the help of KCL) will continue to poll "Shard Iterator" in the background, thus you will be notified about the new data when it comes.
Roughly; you can see Kinesis as a FIFO queue and the messages will disappear in a short time if you don't pop them.
If your application will process a few messages in an hour, you should think about changing your architecture. Kinesis is probably not the correct tool for you.
I've been using AWS SQS, which has a nice feature that when a message is claimed from the queue it locks for a period of time. During this lock if it is processed successfully the message is marked as completed. If the processing fails (and no response is received from the message processor), after a period of time the lock expires and the message is available for another processor to pick up.
Now I have a requirement to use queues outside of SQS (mostly for latency reasons, but potentially for cost reasons too). I'm really looking for a queue provider that has the same characteristic. MSMQ would be the obvious choice for me, since it's already installed and we use it elsewhere, but I can't find any functionality that handles failed messages in the same way.
Does MSMQ allow for this, or is there an easy way to replicate it?
Alternatively, is there another lightweight, open-source messaging service that does?
MSMQ does this already. If you read a message within a transaction and the transaction aborts then the message will reappear in the queue.