AWS Kinesis Stream - Lossing Messages - amazon-web-services

I just had something very strange happen to me with an AWS Kinesis Stream...
I wrote a simple unit test that places a JSON message into an already created stream (the stream was created through the AWS Console). Then my unit test immediately reads that message from the stream. This test process was working for me all last week. I took the weekend off and came back on Monday. Suddenly my unit test stopped working. It could still place the message into the stream but when it tried to read the message out of the stream it found no records inside the stream. No code changes were made at all.
So I deleted the stream through the AWS Console and created a new stream. Told my unit test to use the new stream (that was the only code change done) and everything started working again. So I find this very strange because it's like my original stream just stopped working. The AWS Console said the original stream was still active. Both streams only had one shard.
Does anyone know what could have happened? Is this something that may happen again in the future or is it just a weird anomaly? It makes me very nervous to eventually use this in a production environment if I cannot rely on the AWS Kinesis Stream.

Related

Answering machine with Amazon Connect

We are trying to receive customer calls through Amazon Connect and leave messages in Amazon Kinesis.
When we call Amazon Connect from our cell phones, the voice plays the expected message and the Beep tone sounds as expected. But then the call ends and we cannot leave a message. We tried removing Wait and Stop media streaming but the problem persisted. What are we doing wrong?
Set Voice: OK
Play prompt(Message): OK
Play prompt(Beep): OK
Start media streaming: NG
If you have a simple, easy to understand sample for this application, let me know!
Looks like the problem is your Wait block. Wait isn't supported for voice calls, so immediately errors.
Replace the Wait block with a Get Customer Input block. Use Text to speech for the prompt, Set the prompt value manually to <speak></speak> and set Interpret as to SSML. Set it to detect DTMF and set the timeout to however long the message is allowed to be. From your flow above that is 10 seconds.
This should get the customers voice sent to the Kinesis stream and you can process the stream from there.
There is a really thorough implementation guide for voice mail here. I've used this then altered it to suite my exact needs in the past.

AWS Media Live - handle stream start and stop events

I cannot find any information how to handle the situation like this:
Stream starts: about 3 o'clock.
1.Before the person who is streaming (let's call him a streamer) start to stream I would like to have static image saying something like: 'The event will start soon'.
2.Streamer start pushing his stream to RTMP endpoint but he's late and starts at 3.02. Up until 3.02 the same picture should be visible (as in point 1).
3.Streamer should finish at 4 o'clock but he finishes 5 minutes before 4 (pushing stop at his device).
4.Now, ending screen should be visible from 5 minutes to four and later.
I know that inputs should be switched in order to change a view and this can be scheduled for fixed time, but I would like this to be switched dynamically, ie. when streamer starts pushing to RTMP URL and stops pushing to RTMP URL (from eg. Larix software). How to handle that in AWS Media Live?
Thank you for asking this question on stackoverflow, the easiest way to achieve what you are looking to do is by using an Input Prepare Scheduled Action. The channel will then monitor the input and raise an alarm if the RTMP source is not there. When the RTMP source begins then the alarm will remit, you can send the alarms to a lambda that will look for these alarms and can do the switch from slate MP4 to the RTMP source when it sees the RTMP input missing alarm was cleared. This can also be done for when RTMP input goes away.
Information on Prepare Inputs:
https://docs.aws.amazon.com/medialive/latest/ug/feature-prepare-input.html
Global configuration - Input loss behavior:
https://docs.aws.amazon.com/medialive/latest/ug/creating-a-channel-step3.html
Zach

Azure Web Job reading message from Service Bus doesnt delete message after

The scenario here is that we have a service bus queue and a web job. The web job reads the message from the service bus queue and calls a logic up which then goes on and does other stuff.
The problem we are facing is that after the web job reads the message from the service bus, it occasionally doesn't delete it after, which constantly causes the logic app to be called and flood our database with data.
Here is the message in question as seen from azure management studio:
https://gyazo.com/7f57b460421d1bb4a69fcb8b5a9ff01f
As you can see, there is no lock time on the message. I have tried to play around with the settings to no avail.
When i manually try to delete that message from azure management studio it is also unsuccessful but there is no error message received.
Does anyone know what is going on here? I feel like this is a problem with the queue itself as opposed to a bug in our code since 2-3 tools that i have used are unable to delete this message from the queue.
It looks like the message is only deleted after a specific time (does not go to the dead-letter queue however).
Thanks
So just for information, i figured my own issue out. When the file scraper job runs, it puts a message in the service bus. The webjob now that runs and picks up that file stores the file that it just picked up locally as well as on blob storage.
The problem was that webjob keeps a queue of what it processes locally which was never cleared so every time the webjob run, it was processing all previous files as well.

Amazon Kinesis & AWS Lambda Retries

I'm very new to Amazon Kinesis so maybe this is just a problem in my understanding but in the AWS Lambda FAQ it says:
The Amazon Kinesis and DynamoDB Streams records sent to your AWS Lambda function are strictly serialized, per shard. This means that if you put two records in the same shard, Lambda guarantees that your Lambda function will be successfully invoked with the first record before it is invoked with the second record. If the invocation for one record times out, is throttled, or encounters any other error, Lambda will retry until it succeeds (or the record reaches its 24-hour expiration) before moving on to the next record. The ordering of records across different shards is not guaranteed, and processing of each shard happens in parallel.
My question is, what happens if for some reason some malformed data gets put onto a shard by a producer and when the Lambda function picks it up it errors out and then just keeps retrying constantly? This then means that the processing of that particular shard would be blocked for 24 hours by the error.
Is the best practice to handle application errors like that by wrapping the problem in a custom error and sending this error downstream along with all the successfully processed records and let the consumer handle it? Of course, this still wouldn't help in the case of an unrecoverable error that crashed the program like a null pointer: again we'd be back to the blocking retry loop for the next 24 hours.
Don't overthink it, the Kinesis is just a queue. You have to consume a record (ie. pop from the queue) successfully in order to proceed to the next one. Just like a FIFO stack.
The appropriate approach should be:
Get a record from stream.
Process it in a try-catch-finally block.
If the record is processed successfully, no problem. <- TRY
But if it fails, note it down to another place to investigate the
reason why it failed. <- CATCH
And at the end of your logic blocks, always persist the position to
DynamoDB. <- FINALLY
If an internal occurs in your system (memory error, hardware error
etc) that is another story; as it may affect processing all of the
records, not just one.
By the way, if processing of a record takes more than 1 minute, it is obvious you are doing something wrong. Because Kinesis is designed to handle thousands of records per second, you should not have the luxury of processing such long jobs for each of them.
The question you are asking is a general problem of queue systems, sometimes called "poisonous message". You have to handle them in your business logic to be safe.
http://www.cogin.com/articles/SurvivingPoisonMessages.php#PoisonMessages
This is a common question on processing events in Kinesis and I'll try to give you some points to build your Lambda function to handle such issues with "corrupted" data. Since it is best practice to have separated parts of your system writing to the Kinesis stream and other parts reading from the Kinesis stream, it is common that you will have such problems.
First, why do you have such problematic events?
Using Kinesis to process your events is a good way to break up a complex system that is doing both front-end processing (serving end users), and at the same time/code back-end processing (analyzing events), into two independent parts of your system. The front-end people can focus on their business, while the back-end people don't need to push code changes to the front-end, if they want to add functionality to serve their analytic use cases. Kinesis is a buffer of events that both breaks the need for synchronization as well simplifies the business logic code.
Therefore, we would like events written to the stream to be flexible in their "schema", and if the front-end teams wish to change the event format, add fields, delete fields, change the protocol or the encryption keys, they should be able to do that as often as they want.
Now it is up to the teams that are reading from the stream to be able to process such flexible events in an efficient way, and not break their processing every time such change is happening. Therefore, it should be common that your Lambda function will see events that it can't process, and "poison-pill" is not that rare event as you might expect.
Second, how do you handle such problematic events?
Your Lambda function will get a batch of events to process. Please note that you shouldn't get the events one by one, but in large batches of events. If your batches are too small, you will quickly get large lags on the stream.
For each batch you will iterate over the events, process them and then check-point in DynamoDB the last sequence-id of the batch. Lambda is doing most of these steps automatically with (see more here: http://docs.aws.amazon.com/lambda/latest/dg/walkthrough-kinesis-events-adminuser-create-test-function.html):
console.log('Loading function');
exports.handler = function(event, context) {
console.log(JSON.stringify(event, null, 2));
event.Records.forEach(function(record) {
// Kinesis data is base64 encoded so decode here
payload = new Buffer(record.kinesis.data, 'base64').toString('ascii');
console.log('Decoded payload:', payload);
});
context.succeed();
};
This is what is happening in the "happy path", if all the events are processed without any problem. But if you encounter any problem in the batch and you don't "commit" the events with the success notification, the batch will fail and you will get all the events in the batch again.
Now you need to decide what is the reason of the failure in the processing.
Temporary problem (throttling, network issue...) - it is OK to wait a second and try again for a couple of times. In many cases the issue will resolve itself.
Occasional problem (out of memory...) - it is best to increase the memory allocation of the Lambda function or decrease the batch size. In many cases such modification will resolve the issue.
Constant failure - it means that you have to either ignore the problematic event (put it in a DLQ - dead-letter-queue) or modify your code to handle it.
The problem is to identify the type of failure in your code and handle it differently. You need to write your Lambda code in a way to identify it (type of exception, for example) and react differently.
You can use the integration with CloudWatch to write such failures to the console and create the relevant alarms. You can use the CloudWatch Logs also as a way to log your "dead-letter-queue" and see what is the source of problem.
In your lambda you can either throw an error and thus returning back the whole batch, or you can not throw an error and instead push it to an SQS queue to handle those messages differently. SQS has a retention period of 14 days. You could also have checkpoints with each record to know if the record was processed in the previous run.
If you have a lot of incoming data and you don't want to introduce any latency you could just ignore the error and just move on while adding those events to an SQQ queue.

How can i block writes on a pipeline for reconfiguration of the pipeline?

Im writing a server that regularly needs to change the format of the send/received messages. when this happens the server should send a notification that all future messages have the new format and read all received in the old format until the client sends his ack.
i thought about keeping a reference to the decoder shared by all pipelines and reconfigure it from the outside as needed. I'm worried about concurrency in this case.
how can i make sure that no writes are handled by the pipeline while
i'm working on the decoder?
and how to be sure that the notification
is the first message handled after reconfiguration?
the only other way i see is to send a "notification" object through the pipeline (by using channel.write), catch the object in the decoder and do the reconfig then while forwarding the notification message. In this case there shouldn't be any concurrency in the pipeline.
would this be the better/state of the art way to do this?
i decided to use the second way. A StateHandler catches ConfigurationEvents reconfigures the pipeline. Unfortunately this means that i can't be sure that all channels use the same configuration because race conditions between the reconfiguration and extremely young channels can happen. but i'm pretty sure this won't matter in my case.