AWS Kinesis leaseOwner confusion - amazon-web-services

A very simple application running on a Spark cluster with 2 workers, using Kinesis with 2 shards.
And I check the Kinesis Streams Application State on DynamoDB (show in this screenshot) at region North Virginia.
I start and stop workers from time to time, and I just noticed, when the leaseOwner for 2 shards is the same worker, application works fine.
But when I stop the current leaseOwner (10.0.7.63), then there will be a owner switch and new owner will be the other worker (10.0.7.62), then my application pulls data and no data returned from Kinesis (but, the connection with Kinesis is still on).
My guess, is that when the owner is switched to another worker, the checkpoints on the new owner is not matching what is left inside Kinesis, and the pulling the data will get nothing.
Could anyone please explain a bit what's going on here? Am I guessing it right?
Thanks a lot.

First of all, just a friendly reminder; define "workerID" in the configuration of your application with hostname; it will help you with more user friendly names.
Second, are you sure the shard-000 receives data? Maybe you've set a static partition key on consumer side and that is causing the data to stack on only shard-001?

Related

How spark streaming application works when it fails?

I started learning about spark streaming applications with kinesis. I got a case where our spark streaming application fails, it restarts but the issue is, when it restarts, it tries to process more amount of messages than it can process and fails again. So,
Is there any way, we can limit the amount of data a spark streaming application can process in terms of bytes?
Any let say, if a spark streaming application fails and remains down for 1 or 2 hours, and the InitialPositionInStream is set to TRIM_HORIZON, so when it restarts, it will start from the last messages processed in kinesis stream, but since there is live ingestion going on in kinesis then how the spark streaming application works to process this 1 or 2 hours of data present in kinesis and the live data which is getting ingested in kinesis?
PS - The spark streaming is running in EMR and the batch size is set to 15 secs, and the kinesis CheckPointInterval is set to 60 secs, after every 60 secs it writes the processed data details in DynamoDB.
If my question is/are unclear or you need any more informations for answering my questions, do let me know.
spark-streaming-kinesis
Thanks..
Assuming you are trying to read the data from message queues like kafka or event hub.
If thats the case, when ever spark streaming application goes down, it will try to process the data from the offset it left before getting failed.
By the time, you restart the job - it would have accumulated more data and it will try to process all backlog data and it will fail either by Out of Memory or executors getting lost.
To prevent that, you can use something like "maxOffsetsPerTrigger" configuration which will create a back pressuring mechanism there by preventing the job from reading all data at once. It will stream line the data pull and processing.
More details can be found here: https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html
From official docs
Rate limit on maximum number of offsets processed per trigger
interval. The specified total number of offsets will be proportionally
split across topicPartitions of different volume.
Example to set max offsets per trigger
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1")
.option("subscribe", "topicName")
.option("startingOffsets", "latest")
.option("maxOffsetsPerTrigger", "10000")
.load()
To process the backfills as soon as possible and catch up with real time data, you may need to scale up your infra accordingly.
May be some sort of auto scaling might help in this case.
After processing the backlogged data, your job will scale down automatically.
https://emr-etl.workshop.aws/auto_scale/00-setup.html

How to track distributed tasks progress

Here is my case:
When my server receieve a request, it will trigger distributed tasks, in my case many AWS lambda functions (the peek value could be 3000)
I need to track each task progress / status i.e. pending, running, success, error
My server could have many replicas
I still want to know about the task progress / status even if any of my server replica down
My current design:
I choose AWS S3 as my helper
When a task start to execute, it will create marker file in a special folder on S3 e.g. running folder
When the task fail or success, it will move the marker file from running folder to fail folder or success folder
I check the marker files on S3 to check the progress of the tasks.
The problems:
There is a limit for AWS S3 concurrent access
My case is likely to exceed the limit some day
Attempt Solutions:
I had tried my best to reduce the number of request to S3
I don't want to track the progress by storing data in my DB because my DB has already been under heavy workload.
To be honest, it is kind of wierd that using marker files on S3 to track progress of the tasks. However, it worked before.
Is there any recommendations ?
Thanks in advance !
This sounds like a perfect application of persistent event queueing, specifically Kinesis. As each Lambda starts it generates a “starting” event on Kinesis. When it succeeds or fails, it generates the appropriate event. You could even create progress events along the way if you want to see how far they have gotten.
Your server can then monitor the number of starting events against ending events (success or failure) until these two numbers are equal. It can query the error events to see which processes failed and why. All servers can query the same events without disrupting each other, and any server can go down and recover without losing data.
Make sure to put an Origination Key on events that are supposed to be grouped together so they don't get mixed up with a subsequent event. Also, each Lambda should be given its own key so you can trace progress per Lambda. Guids are perfect for this.

I need help clarifying a high level use-case of Amazon SQS

So I need a second pair of eyes to correct or confirm my understand standing of Amazon SQS. From my understanding, you can add an unlimited amount of messages to one queue. A message can be 256 KB in size, and if it needs to be larger than that, you can use amazon s3 to store 2 GB. Reading around online, it appears there are many use cases for this queuing service. For example one use case of SQS can act as a database buffer.
But here's what I'm looking to do.. I'm looking to make a real time messaging system. My current functionality acts like more of a message board, so the implementation just inserts into the database then reads the data and packages it into JSON to be inserted on SQLITE mobile phone. That works great, but I'm getting a lot of requests from people to make it real-time.
So what I'm wondering is can I utilize amazon SQS to write and read messages for a chat application? So in my theoretical use case of SQS would have a message queue to write to, and pull from the that queue every second to check for messages on mobile. But here's where I'm confused. Since you cannot "Query" a particular message from the queue, would it make sense to have a queue per user then a generic queue for the app server to read from? Or am I just talking crazy and should spend cognitive resources thinking about implementing an open connection on an Ec2 instance?
Any help would be great,
Thanks!
Have you thought about using Amazon SNS to push the chat messages to your mobile devices? Each user publishes to a topic and the readers subscribe to that topic. You just have to be ok with missing messages if the app isn't running.
If you only have a few (or maybe, less than 100) users, you could have thought of having one SQS queue per user. If that is not so, the solution won't be operationally feasible.
If you were to have one generic queue, SQS won't help because it doesn't allow querying for a given field in all available messages.
I can think of following options for your use case:
Setup one Redis cluster, possibly on Amazon ElastiCache. Have one message List per user.
One Messages table in MySQL, possibly on AWS RDS. This will provide an easy way to query messages for a given user.
You can also use DynamoDB in #2.

Amazon Elasticache Failover

We have been using AWS Elasticache for about 6 months now without any issues. Every night we have a Java app that runs which will flush DB 0 of our redis cache and then repopulate it with updated data. However we had 3 instances between July 31 and August 5 where our DB was successfully flushed and then we were not able to write the new data to the database.
We were getting the following exception in our application:
redis.clients.jedis.exceptions.JedisDataException:
redis.clients.jedis.exceptions.JedisDataException: READONLY You can't
write against a read only slave.
When we look at the cache events in Elasticache we can see
Failover from master node prod-redis-001 to replica node
prod-redis-002 completed
We have not been able to diagnose the issue and since the app was running fine for the past 6 months I am wondering if it is something related to a recent Elasticache release that was done on the 30th of June.
https://aws.amazon.com/releasenotes/Amazon-ElastiCache
We have always been writing to our master node and we only have 1 replica node.
If someone could offer any insight it would be much appreciated.
EDIT: This seems to be an intermittent problem. Some days it will fail other days it runs fine.
We have been in contact with AWS support for the past few weeks and this is what we have found.
Most Redis requests are synchronous including the flush so it will block all other requests. In our case we are actually flushing 19m keys and it takes more then 30 seconds.
Elasticache performs a health check periodically and since the flush is running the health check will be blocked, thus causing a failover.
We have been asking the support team how often the health check is performed so we can get an idea of why our flush is only causing a failover 3-4 times a week. The best answer we can get is "We think its every 30 seconds". However our flush consistently takes more then 30 seconds and doesn't consistently fail.
They said that they may implement the ability to configure the timing of the health check however they said this would not be done anytime soon.
The best advice they could give us is:
1) Create a completely new cluster for loading the new data on, and
instead of flushing the previous cluster, re-point your application(s)
to the new cluster, and remove the old one.
2) If the data that you are flushing is an update version of the data,
consider not flushing, but updating and overwriting new keys?
3) Instead of flushing the data, set the expiry of the items to be
when you would normally flush, and let the keys be reclaimed (possibly
with a random time to avoid thundering herd issues), and then reload
the data.
Hope this helps :)
Currently for Redis versions from 6.2 AWS ElastiCache has a new feature of thread monitoring. So the health check doesn't happen in the same thread as all other actions of Redis. Redis can continue to proceed a long command / lua script, but will still considered healthy. Because of this new feature failovers should happen less.

Recommendation for batch processing on aws

I'm new to using AWS, so any pointers would be appreciated.
I have a need to process large files using our in-house software.
It takes about 2GB of input and generates 5GB of output, running for 2 hours on a c3.8xlarge.
For now I do it manually, start an instance (either on-demand or spot-request), but now I want to reliably automate and scale this processing - what are good frameworks or platform or amazon services to do that?
Especially regarding the possibility that a spot-instance will be terminated half-way through (and I'll need to detect that and restart the job).
I heard about Python Celery, but does it work well with amazon and spot-instances?
Or are there other recommended mechanisms?
Thank you!
This is somewhat opinion-based, but you can mix and match some of the AWS pieces to make this easier:
put the input data on S3
push an entry into a SQS queue indicating a job needs to be processed with a long visibility timeout
set up an autoscaling policy based on SQS with your machine description in CloudFormation.
use UserData/cloudinit to set up the machine and start your application
write code to receive the queue entry, start processing, finish processing, then delete the SQS message.
code should check for another queued entry. If none, code should terminate machine.