WSO2 CEP event lifecycle - wso2

Is there any document/article explaining the event lifecycle in WSO2 CEP?
I dont quite understand how the events are discarded from the event streams.
Thank you,
Hugo Calado

Events will be discarded immediately. Basic flow is stream will receive events from and receivers and it will immediately push events to publisher without storing. If you want to collect event for certain time periods you can use somwthing like time windows in Siddhi Execution Plans [1].
In following Siddhi query it collects events for 10 minutes and insert into AvgTempStream by calculating average stream. In that case events will be stored for 10 minutes in memory.
from TempStream#window.time(10 min)
select avg(temp) as avgTemp, roomNo, deviceID
insert all events into AvgTempStream;
[1] https://docs.wso2.com/display/CEP400/SiddhiQL+Guide+3.0#SiddhiQLGuide3.0-Window

Related

How to tell when Lambdas complete processing of all messages in SQS

Currently I have a process where a Lambda (A) gets triggered which has logic to find out what customers need to have another lambda (B) run for (via a queue). For any run there could be 3k to 4k messages placed on the SQS Queue by Lambda A to be picked up by Lambda B to process. As Lambda B communicates with an external Api, the concurrency is set to 10 for Lambda B so as not to overload the Api. The whole process completes in 35 to 45 minutes.
My problem is how to tell when all the processing is complete?
If you don't need timely information, you could check out the CloudWatch Metrics that SQS offers, e.g.:
ApproximateNumberOfMessagesVisible
The number of messages available for retrieval from the queue.
Reporting Criteria: A non-negative value is reported if the queue is active.
and
ApproximateNumberOfMessagesNotVisible
The number of messages that are in flight. Messages are considered to be in flight if they have been sent to a client but have not yet been deleted or have not yet reached the end of their visibility window.
Reporting Criteria: A non-negative value is reported if the queue is active.
If the sum of these two metrics hits zero, no messages are in the Queue, and processing should be done.
If you need more timely information, the producer of the messages could increment a counter item in DynamoDB with the number of messages added, and each Lambda decrements that counter once it's done. You could then add a Lambda to the DynamoDB Stream of that table with a filter and do something when the value changes to zero again. This is, however, much more complex.
A third option could be to transform the whole thing into a stepfunction and use a map state with a parallelization factor to work on the tasks. The drawback is that the length of the list it can work on is limited afaik.

I want to know when a batch of messages has completed in a AWS SQS Queue

I think this is more of a 'architecture design' question.
I have a lambda producer that will put ~600 messages on a SQS queue (there are multiple producers) as a batch (so not 1 message with a body of ~600 messages). A consumer lambda that will take individual messages and deal with them (at scale). What I want to do is run another lambda when each batch is complete.
Initial ideas was to create a 'unique batch number', a 'total batch number' and a 'batch position number' and add it to the messages attributes for every message. And then in the consumer lambda check the these to decide if the batch is complete.
But does that mean I would need to use a FIFO queue and partition on the batch number and only have one lambda consumer per batch. Or do I run some sort of state management in DynamoDB (is the a pattern out there for this? please guide me on this).
Regards, J
It seems like the goal is to achieve Fork-Join capabilities in a distributed system. One way to handle this in AWS is using Step Functions. Assuming a queue service needs to be used, state of the overall operation will need to be tracked. Some ways to do this are:
Store state of the overall operation in a DB.
Put a 'terminatation' message in the queue after all others and process FIFO.
Create a metadata service which receives 'start' and 'stop' messages for each service and handles them accordingly.
Reference: Fork and Join with Amazon Lambda

How to debounce events on AWS grouped by a key?

Our frontend application sends user actions to a lambda function behind an API gateway, which then stores these actions in dynamodb.
We then use dynamodb streams to trigger a separate lambda function that'll parse these actions in dynamodb and decide if the user's actions should result in any notifications being sent (we call these notification events).
For example, if a user places a comment in our app, we'll store a "CREATED_COMMENT" action in dynamodb, which will then trigger a new lambda through a dynamodb stream. The new lambda may then create an "email notification event", which we may send to an email provider like customer.io
However, our users have informed us that they receive emails too frequently, and thus we'd like to start sending email digests aggregating multiple actions over time into a single email rather than sending an email for each action.
Our idea was to use something like AWS EventBridge, Kinesis, Step Functions, or even DynamoDB streams to resend the dynamodb stream actions to, but then configure the new stream's events to be grouped by email address and for these events to be debounced by e.g. 10 minutes. If the user then performs a new action, that user's stream will continue gathering actions for another 10 minutes, until there's been no new actions from that user for 10 minutes. Once that happens, the stream will "release" all gathered actions and invoke a lambda function. Our lambda function will then generate the email notification event and send it to e.g. customer.io.
However, we've been unable to find such grouping and debounced flushing configuration in any of the aforementioned AWS stream services. For such a common thing as digesting (or rolling up), shouldn't there be a serverless approach to doing this without having to write our own queueing service?
The answer to me seems like using a tool such as SQS. SQS will allow you to accumulate messages into a queue and every x minutes you can then read the queue using a Lambda function to do so on a schedule event. You do not need to have a Lambda triggered by SQS, and can still read the queue "manually" from within Lambda instead.
Gareth McCumskey is on the right track.
Use a normal sqs queue for strictly for debouncing.
Set a batch window, i.e 5 seconds. Use a really large batch size when you read from the queue.
In code, use a hashMap to group your message with the same messageId together. Now use your deduped messageIDs to do your work.
I wrote a blog post on something just like this. The short version of it is that it uses a scheduled Lambda function to identify the records that need to be processed.
The problem with using the delay in SQS is that you can only receive 10 messages at a time, so in order to get all the messages you'd have to call SQS repeatedly to clear the queue. At that point, you can aggregate the messages. This doesn't scale very well, as all the messages have to be read in order for it to work. By using DynamoDB you can actually have just one record that represents the collection of records, and query the single record, which then can result in a message in a queue for that specific group of messages. Consider the following data:
user | comment | time
user 1 | comment 1 | 11:43am
user 1 | comment 2 | 11:50am
user 2 | comment 1 | 11:51am
You can add another record that is a signal for the need to send a message for each user (in this example 15 minutes after the first message).
user | scheduled
user 1 | 11:58
user 2 | 12:06
When you insert the second set of records you are inserting the time when you want to send the batch. You only do the insert if there isn't a record already, so you don't end up constantly increasing the time. Your scheduled process reads that record to know what users it needs to send messages to and collects all the data for that user. The process of sending the messages to each user can be done in parallel (you could send a message the SQS for each user or use a Map state in a step function, for example).

Kinesis stream pending message count

I am trying to use AWS Kinesis stream for one of our data streams. I would like to monitor pending messages on my stream for ops purposes(scale downstream according to backlog), but unable to find any API that gives (approx) pending messages in my stream.
This looks strange as messages get expired after 7 days and if the producers and consumers are isolated and can't communicate, how do you know messages are expiring. How do you handle this problem?
Thanks!
There is no such concept as "pending" message in Kinesis. All the incoming data will be placed on a shard.
Your consumer application should be in running state all the time, to keep track of changes in your stream. The application (with the help of KCL) will continue to poll "Shard Iterator" in the background, thus you will be notified about the new data when it comes.
Roughly; you can see Kinesis as a FIFO queue and the messages will disappear in a short time if you don't pop them.
If your application will process a few messages in an hour, you should think about changing your architecture. Kinesis is probably not the correct tool for you.

WSO2 CEP output event adaptor error

I am new to WSO2 CEP
I have created the entire flow to read the JMS message and split it using Text formatter. The problem is that when I try to push messages into the queue, it is not able to reach the the output event adaptor. I have a mysql event adaptor and configured it into my event formatter but I keep getting the below message in my log
[2014-02-13 21:20:06,347] ERROR - {ReceiverGroup} No receiver is reachable at reconnection, can't publish the events
[2014-02-13 21:20:06,352] ERROR - {AsyncDataPublisher} Reconnection failed for for tcp://localhost:7661
Can someone help me understand what is this tcp://localhost:7661 is all about
Regards
Subbu
tcp://localhost:7661 is the default port to which Thrift(WSO2 events are published). It seems a default event formatter has been created and trying to publish events to that port.
Can you check your list of event formatters and ensure that no event formatters of type WSO2Event are created. This event formatter might be automatically created if you set an exported stream to be 'pass-through' when creating the execution plan.
You can enable event tracing[1] and monitor to determine exactly upto which point the event is coming in your configured flow.
[1] http://docs.wso2.org/display/CEP300/CEP+Event+Tracer
HTH,
Lasantha