Siddhi: state persistence and realtime queryable state? - wso2

My Scenario is like this.
I want to see/query what is the current aggregation value(s) of a particular query of the active processing window.
I have seen this in Apache Flink.
For e.g:
Say I have a query to count total number of failures, windowing to every 12 hours. And I want to ask (from another application) what is the current count for active aggregating window. Note that active window is still processing.
Reason is that my application need to give a feedback to the user regarding its current total failure count. So he/she can act based on that. Waiting until the window is processed and get the count then, is not the desired behavior in the perspective of user.
Is this possible? If so how?

One option is to use rolling time window. Rolling time window will give you the rolling aggregation(sum, count, etc) for a given time range. So for every incoming event you will get an output event with the count. You can use that to give feedback. There are two catches of this approach. One is it is a rolling count not a batch count. Other one is process is triggered with events to count stream. If you want to trigger the feedback depending on another requirement(Ex: user initiated, every hour etc) this approach will not work. For that you need to use below approach.
Use a time batch window and then join it with another stream which will get triggered depending on the business requirement. Below is a sample and here are the testcases for your reference.
from countStream#window.timeBatch(12 hrs) right outer join
feedbackTriggerStream#window.length(1)
select count() as totalFailures
insert into FeedbackStream;
Another option is to use the query feature. This approach is suitable if you are using Siddhi as a library and you have access to SiddhiAppRuntime. Below is a code sample for that. Lets assume below is your window query to calculate count.
define countWindow(userid string, reason string) timeBatch(12 hrs);
From countStream
Select *
Insert into countWindow;
Then you can use queries as below to access window data.
Event[] events = siddhiAppRuntime.query(
"from countWindow " +
"select count() as totalCount ");
events will contain the one event with count. Here is a reference to testcases.

Related

DynamoDB Single Table Design guidance

First time building a single table design and was just wondering if anyone had any advice/feedback/better ways on the following plan?
Going to be building a basic 'meetup' clone. so e.g. Users can create events, and then users can attend those events essentially.
How the entities in the app relate to eachother:
Entities: (Also added an 'ItemType' to each entity) so e.g. ItemType=Event
Key Structure:
Access Patterns:
Get All Attendees for an event
Get All events for a specific user
Get all events
Get a single event
Global Secondary Indexes:
Inverted Index: SK-PK-index
ItemType-SK-Index
Queries:
1. Get all attendees for an event:
PK=EVENT#E1
SK=ATTENDEE# (begins with)
2. Get All Events for a specific User
Index: SK-PK-index
SK=ATTENDEE#User1
PK=EVENT# (Begins With)
3. Get All Events (I feel like there's a much better way to do this, if there is please let me know)
Index: ItemType-SK-Index
ItemType=Event
SK=EVENT# (Begins With)
4. Get a single event
PK=EVENT#E1
SK=EVENT#E1
A couple of questions I had:
When returning a list of attendees, I'd want to be able to get extra data for that attendee, e.g. First/Lastname etc.
Based on this example: https://aws.amazon.com/getting-started/hands-on/design-a-database-for-a-mobile-app-with-dynamodb/module-5/
To avoid having to duplicate data and handle when data changes, (e.g. user changes name) use Partial Normalization and the BatchGetItem API to retrieve details?
For fuzzy searches etc, is the best approach to stream this data into e.g. elastic/opensearch?
If so, when building API's - would you still use dynamoDB for some queries, or just use elasticsearch for everything?
e.g. for Get All Events - would using an ItemType of 'Events' end up creating a hot partition if there's a huge number of events?
Sorry for the long post, Would appreciate any feedback/advice/better ways to do things, thank you!!

Azure WebJobs - how to keep the state?

I have a need to implement some kind of orchestrator by a WebJob. So it needs to keep a state for some kind of internal queue.
Is there any way apart from static field and from using database to keep that state?
In general the idea is simple: I have a calculation job. It get's e.g. ProductId and is doing calculations for it. It takes some time, so when another message for same ProductId is coming I need to wait, until previous calculations will finish. But at the same time I can pick a message for another ProductId if there are no running calculations for that.
I haven't found any way to make sequential processing of messages based on a specific conditions. So end up with idea to implement a stateful orchestrator which will do the trick.
Am I doing it in a wrong way?

Reprocess batches of items over and over again - and the batch might change any time

I am just looking for ideas on how to solve one specific thing I'd like to build.
Say I have two sets of items. Each item is just a couple of lines of JSON. Any time an item is added to one set I immediately (well, almost) want to process this against the full other set. So item is added to set A: Process against each item in set B. And vice versa.
Items come in through API Gateway + Lambda. Match processing in Lambda from a queue/stream.
What AWS technology would be a good fit? I have no idea and no clear pattern on when or how often the sets change. Also, I want it to be as strongly consistent as possible. And of course, I want it to be as serverless and cost-effective as possible. :)
Options could be:
sets stored in Aurora, match processing for a new item in A would need to query the full set B from the database each time
sets stored in DynamoDB, maybe with DynamoDB stream in the background; match processing for a new item in A would need to query the full set B from Dynamo; but spiky load, not a good fit because of unclear read/write provisioning
have each set in its own "static" Kinesis stream where match processing reads through items but doesn't trim. Streams to be replaced with fresh sets regularly
My pain point is: While processing items from A there might be thousands of items in B to be matched. And I want to avoid having to load the full set B from some database every time I process an item from A. I was thinking about some caching of sets but then would need a good option to invalidate that cache whenever something changes.

Database polling, prevent duplicate fetches

I have a system whereby a central MSSQL database keeps in a table a queue of jobs that need to be done.
For the reasons that processing requirements would not be that high, and that there would not be a particularly high frequency of requests (probably once every few seconds at most) we made the decision to have the applications that utilise the queue simply query the database whenever one is needed; there is no message queue service at this time.
A single fetch is performed by having the client application run a stored procedure, which performs the query(ies) involved and returns a job ID. The client application then fetches the job information by querying by ID and sets the job as handled.
Performance is fine; the only snag we have felt is that, because the client application has to query for the details and perform a check before the job is marked as handled, on very rare occasions (once every few thousand jobs), two clients pick up the same job.
As a way of solving this problem, I was suggesting having the initial stored procedure that runs "tag" the record it pulls with the time and date. The stored procedure, when querying for records, will only pull records where this "tag" is a certain amount of time, say 5 seconds, in the past. That way, if the stored procedure runs twice within 5 seconds, the second instance will not pick up the same job.
Can anyone foresee any problems with fixing the problem this way or offer an alternative solution?
Use a UNIQUEIDENTIFIER field as your marker. When the stored procedure runs, lock the row you're reading and update the field with a NEWID(). You can mark your polling statement using something like WITH(READPAST) if you're worried about deadlocking issues.
The reason to use a GUID here is to have a unique identifier that will serve to mark a batch. Your NEWID() call is guaranteed to give you a unique value, which will be used to prevent you from accidentally picking up the same data twice. GETDATE() wouldn't work here because you could end up having two calls that resolve to the same time; BIT wouldn't work because it wouldn't uniquely mark off batches for picking up or reporting.
For example,
declare #ReadID uniqueidentifier
declare #BatchSize int = 20; -- make a parameter to your procedure
set #ReadID = NEWID();
UPDATE tbl WITH (ROWLOCK)
SET HasBeenRead = #ReadID -- your UNIQUEIDENTIFIER field
FROM (
SELECT TOP (#BatchSize) Id
FROM tbl WITH(UPDLOCK ROWLOCK READPAST )
WHERE HasBeenRead IS null ORDER BY [Id])
AS t1
WHERE ( tbl.Id = t1.Id)
SELECT Id, OtherCol, OtherCol2
FROM tbl WITH(UPDLOCK ROWLOCK READPAST )
WHERE HasBeenRead = #ReadID
And then you can use a polling statement like
SELECT COUNT(*) FROM tbl WITH(READPAST) WHERE HasBeenRead IS NULL
Adapted from here: https://msdn.microsoft.com/en-us/library/cc507804%28v=bts.10%29.aspx

Auto-increment on Azure Table Storage

I am currently developing an application for Azure Table Storage. In that application I have table which will have relatively few inserts (a couple of thousand/day) and the primary key of these entities will be used in another table, which will have billions of rows.
Therefore I am looking for a way to use an auto-incremented integer, instead of GUID, as primary key in the small table (since it will save lots of storage and scalability of the inserts is not really an issue).
There've been some discussions on the topic, e.g. on http://social.msdn.microsoft.com/Forums/en/windowsazure/thread/6b7d1ece-301b-44f1-85ab-eeb274349797.
However, since concurrency problems can be really hard to debug and spot, I am a bit uncomfortable with implementing this on own. My question is therefore if there is a well tested impelemntation of this?
For everyone who will find it in search, there is a better solution. Minimal time for table lock is 15 seconds - that's awful. Do not use it if you want to create a truly scalable solution. Use Etag!
Create one entity in table for ID (you can even name it as ID or whatever).
1) Read it.
2) Increment.
3) InsertOrUpdate WITH ETag specified (from the read query).
if last operation (InsertOrUpdate) succeeds, then you have a new, unique, auto-incremented ID. If it fails (exception with HttpStatusCode == 412), it means that some other client changed it. So, repeat again 1,2 and 3.
The usual time for Read+InsertOrUpdate is less than 200ms. My test utility with source on github.
See UniqueIdGenerator class by Josh Twist.
I haven't implemented this yet but am working on it ...
You could seed a queue with your next ids to use, then just pick them off the queue when you need them.
You need to keep a table to contain the value of the biggest number added to the queue. If you know you won't be using a ton of the integers, you could have a worker every so often wake up and make sure the queue still has integers in it. You could also have a used int queue the worker could check to keep an eye on usage.
You could also hook that worker up so if the queue was empty when your code needed an id (by chance) it could interupt the worker's nap to create more keys asap.
If that call failed you would need a way to (tell the worker you are going to do the work for them (lock), then do the workers work of getting the next id and unlock)
lock
get the last key created from the table
increment and save
unlock
then use the new value.
The solution I found that prevents duplicate ids and lets you autoincrement it is to
lock (lease) a blob and let that act as a logical gate.
Then read the value.
Write the incremented value
Release the lease
Use the value in your app/table
Then if your worker role were to crash during that process, then you would only have a missing ID in your store. IMHO that is better than duplicates.
Here is a code sample and more information on this approach from Steve Marx
If you really need to avoid guids, have you considered using something based on date/time and then leveraging partition keys to minimize the concurrency risk.
Your partition key could be by user, year, month, day, hour, etc and the row key could be the rest of the datetime at a small enough timespan to control concurrency.
Of course you have to ask yourself, at the price of date in Azure, if avoiding a Guid is really worth all of this extra effort (assuming a Guid will just work).