How spark streaming application works when it fails? - amazon-web-services

I started learning about spark streaming applications with kinesis. I got a case where our spark streaming application fails, it restarts but the issue is, when it restarts, it tries to process more amount of messages than it can process and fails again. So,
Is there any way, we can limit the amount of data a spark streaming application can process in terms of bytes?
Any let say, if a spark streaming application fails and remains down for 1 or 2 hours, and the InitialPositionInStream is set to TRIM_HORIZON, so when it restarts, it will start from the last messages processed in kinesis stream, but since there is live ingestion going on in kinesis then how the spark streaming application works to process this 1 or 2 hours of data present in kinesis and the live data which is getting ingested in kinesis?
PS - The spark streaming is running in EMR and the batch size is set to 15 secs, and the kinesis CheckPointInterval is set to 60 secs, after every 60 secs it writes the processed data details in DynamoDB.
If my question is/are unclear or you need any more informations for answering my questions, do let me know.
spark-streaming-kinesis
Thanks..

Assuming you are trying to read the data from message queues like kafka or event hub.
If thats the case, when ever spark streaming application goes down, it will try to process the data from the offset it left before getting failed.
By the time, you restart the job - it would have accumulated more data and it will try to process all backlog data and it will fail either by Out of Memory or executors getting lost.
To prevent that, you can use something like "maxOffsetsPerTrigger" configuration which will create a back pressuring mechanism there by preventing the job from reading all data at once. It will stream line the data pull and processing.
More details can be found here: https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html
From official docs
Rate limit on maximum number of offsets processed per trigger
interval. The specified total number of offsets will be proportionally
split across topicPartitions of different volume.
Example to set max offsets per trigger
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1")
.option("subscribe", "topicName")
.option("startingOffsets", "latest")
.option("maxOffsetsPerTrigger", "10000")
.load()
To process the backfills as soon as possible and catch up with real time data, you may need to scale up your infra accordingly.
May be some sort of auto scaling might help in this case.
After processing the backlogged data, your job will scale down automatically.
https://emr-etl.workshop.aws/auto_scale/00-setup.html

Related

How to slow down reads in Kinesis Consumer Library?

We have an aggregation system where the Aggregator is an KDA application running Flink which Aggregates the data over 6hrs time window and puts all the data into AWS Kinesis data Stream.
We also have an Consumer Application that uses KCL 2.x library and reads the data from KDS and puts the data into DynamoDB. We are using default KCL configuration and have set the poll time to 30 seconds. The issue that we are facing now is, Consumer application is reading all the data in KDS with in few minutes, causing huge writes in DynamoDB with in short period of time causing scaling issues in DynamoDB.
We would like to consume the KDS Data slowly and even out the data consumption across time allowing us to keep lower provisioned capacity for WCU's.
One way we can do that is increase the polling time for the KCL consumer application, I am trying to see if there is any configuration that can limit the number of records that we can poll, helping us to reduce the write throughput in dynamoDB or any other way to fix this problem?
Appreciate any responses

Kinesis vs SQS, which is the best for this particular case?

I have been reading about Kinesis vs SQS differences and when to use each but I'm struggling to know which is the appropriate solution for this particular problem:
Strava-like app where users record their runs
50 incoming runs per second
The processing of each run takes exactly 1 minute
I want the user to have their results in less than 5 minutes
A run is just a guid, the job that processes it will get al the info from S3
If i understand correctly in kinesis you can have 1 worker per shard, correct? That would mean 1 runs per minute. Since i have 3000 incoming runs per minute, to meet the 5 minute deadline would mean i would need to have 600 shards with 1 worker each.
Is this assumption correct?
With SQS I can just have 1 queue and as many workers as I like, up to SQS's limit of 120,000 inflight messages.
If 1 run errors during processing I want to reprocess it a few more times and then store it for further inspection.
I don't need to process messages in order, and duplicates are totally fine.
1 worker per message, after it's processed i no longer care about the message
In that case, a queuing services such as SQS should be used. Kinesis is a streaming service, which persist a data. This means that multiple works can read messages from a stream for as long as they are valid. Non of your workers would be able to remove the message from the stream.
Also with SQS you can setup dead-letter queues which would allow you capture messages with fail to process after a pre-defined number of trials.

Stop Streaming pipeline when no more messages to consume

I have a streaming dataflow pipeline job which reads messages from a given pub-sub topic.
I understand there is an auto-ack once the bundles are committed. How to make the pipeline stop where there are no more messages to consume?
Edit - I have a scenario where I need to drain off incorrect messages sent to the topic. Thus this would a one time job. My application sends 1MM messages only once a day (fixed time) to that topic.
Why would you want to stop the streaming pipeline, concerned about the being charged when the pipeline is doing nothing? If this is your concern then you should not be concerned at all since you will only be charged for the resources you use e.g CPU hour, Memory, Disk Storage, etc. please see pricing details here
Since your source is unbounded (e.g PubSub) then there's no way you could tell that there will be no more incoming data in the future.

Amazon KCL Checkpoints and Trim Horizon

How are checkpoints and trimming related in AWS KCL library?
The documentation page Handling Startup, Shutdown, and Throttling says:
By default, the KCL begins reading records from the tip of the
stream;, which is the most recently added record. In this
configuration, if a data-producing application adds records to the
stream before any receiving record processors are running, the records
are not read by the record processors after they start up.
To change the behavior of the record processors so that it always
reads data from the beginning of the stream, set the following value
in the properties file for your Amazon Kinesis Streams application:
initialPositionInStream = TRIM_HORIZON
The documentation page Developing an Amazon Kinesis Client Library Consumer in Java says:
Streams requires the record processor to keep track of the records
that have already been processed in a shard. The KCL takes care of
this tracking for you by passing a checkpointer
(IRecordProcessorCheckpointer) to processRecords. The record processor
calls the checkpoint method on this interface to inform the KCL of how
far it has progressed in processing the records in the shard. In the
event that the worker fails, the KCL uses this information to restart
the processing of the shard at the last known processed record.
The first page seems to say that the KCL resumes at the tip of the stream, the second page at the last known processed record (that was marked as processed by the RecordProcessor using the checkpointer). In my case, I definitely need to restart at the last known processed record. Do I need to set the initialPositionInStream to TRIM_HORIZON?
With kinesis stream you have two options, you can read the newest records, or start from the oldest (TRIM_HORIZON).
But, once you started your application it just reads from the position it stopped using its checkpoints.
You can see those checkpoints in dynamodb (Usually the table name is as the app name).
So if you restart your app it will usually continue from where it stopped.
The answer is no, you don't need to set the initialPositionInStream to TRIM_HORIZON.
When you are reading events from a kinesis stream, you have 4 options:
TRIM_HORIZON - the oldest events that are still in the stream shards before they are automatically trimmed (default 1 day, but can be extended up to 7 days). You will use this option if you want to start a new application that will process all the records that are available in the stream, but it will take a while until it is able to catch up and start processing the events in real-time.
LATEST - the newest events in the stream, and ignore all the past events. You will use this option if you start a new application that you want to process in teal time immediately.
AT/AFTER_SEQUENCE_NUMBER - the sequence number is usually the checkpoint that you are keeping while you are processing the events. These checkpoints are allowing you to reliably process the events, even in cases of reader failure or when you want to update its version and continue processing all the events and not lose any of them. The difference between AT/AFTER is based on the time of your checkpoint, before or after you processed the events successfully.
Please note that this is the only shard specific option, as all the other options are global to the stream. When you are using the KCL it is managing a DynamoDB table for that application with a record for each shard with the "current" sequence number for that shard.
AT_TIMESTAMP - the estimate time of the event put into the stream. You will use this option if you want to find specific events to process based on their timestamp. For example, when you know that you had a real life event in your service at a specific time, you can develop an application that will process these specific events, even if you don't have the sequence number.
See more details in Kinesis documentation here: https://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetShardIterator.html
You should use the "TRIM_HORIZON". It will only have effect on the first time your application starts to read records from the stream.
After that, it will continue from last known position.

Status of kinesis stream reader

How do I tell what percentage of the data in a Kinesis stream a reader has already processed? I know each reader has a per-shard checkpoint sequence number, and I can also get the StartingSequenceNumber of each shard from describe-stream, however, I don't know how far along in my data the reader currently is (I don't know the latest sequence number of the shard).
I was thinking of getting a LATEST iterator for each shard and getting the last record's sequence number, however that doesn't seem to work if there's no new data since I got the LATEST iterator.
Any ideas or tools for doing this out there?
Thanks!
I suggest you implement a custom metric or metrics in your applications to track this.
For example, you could append a message send time within your Kinesis message, and on processing the message, record the time difference as an AWS CloudWatch custom metric. This would indicate how close your consumer is to the front of the stream.
You could also record the number of messages pushed (at the pushing application) and messages received at the Kinesis consumer. If you compare these in a chart on CloudWatch, you could see that the curves roughly follow each other indicating that the consumer is doing a good job at keeping up with the workload.
You could also try monitoring your Kinesis consumer, to see how often it idly waits for records (i.e, no results are returned by Kinesis, suggesting it is at the front of the stream and all records are processed)
Also note there is not a way to track a "percent" processed in the stream, since Kinesis messages expire after 24 hours (so the total number of messages is constantly rolling). There is also not a direct (API) function to count the number of messages inside your stream (unless you have recorded this as above).
If you use KCL you can do that by comparing IncomingRecords from the cloudwatch built-in metrics of Kinesis with RecordsProcessed which is a custom metric published by the KCL.
Then you select a time range and interval of say 1 day.
You would then get the following type of graphs:
As you can see there were much more records added than processed. By looking at the values in each point you will know exactly if your processor is behind or not.