I wanted to know if the WSO2 CEP/Siddhi query supports returning multiple rows if yes how data from those rows can be mapped to the output XML ? e.g. my event stream has a field statusCode which can have values A/B/C I wanted to write a query which gives me the count by status type for past 5 mins e.g A-10,B-5,C-2.. in the current query i used group by statusCode to get the count of status
MyQuery- ...insert into TestStream statusCode, count(statusCode) as count group by statusCode
and my output XML is something like
<statusSmry>
<status>{statusCode}</status>
<count>{count}</status>
</statusSmry>
the output i receive is something like
<statusSmry>
<status>A</status>
<count>10</status>
</statusSmry>
.....
<statusSmry>
<status>B</status>
<count>5</status>
</statusSmry>
....
<statusSmry>
<status>C</status>
<count>2</status>
</statusSmry>
Is it possible to get results of query in a single XML ? i.e. in above case counts for A,B,C in a single XML ?
Thanks
Rajiv
What you asked is not possible in Siddhi.
This is because whenever there is an input event the total count will be updated, at the same time an output for the corresponding updated group need to be triggered to notify the subscribers. Since this is a realtime process Siddhi cannot accumulate all the events and output as one event/XML. If in any case its going to accumulate the events then there will be a problem for how long it's going to accumulate for, 1sec or 1day?, and in what format the output need to be sent, therefore currently it's (WSO2 CEP 2.0.1) not supporting accumulation.
If you need this feature then you have to send the output of CEP to an ESB and run some kind of an aggregation process.
Suho
Related
So we have 100 different types of messages coming into our Kinesis stream. We only want to save 4 types. I know Kinesis can transform messages, but can it filter as well? How is this done?
Filtering is just a transform in which you decide not to output anything. You indicate this by sending the result with a value "Dropped" as per the documentation.
You can find at this post an example of transform, and the logic includes several things: letting records just pass through without any transform (status "OK"), transforming and outputting a record (again, status "OK"), dropping -or filtering- a record (status "Dropped"), and communicating an error using the status "ProcessingFailed"
I am working with Kinesis Analytics and I am trying to understand how to write my application to give me a sliding window over 24 hours. What I have generates the right data, but it looks like it regenerates it every time, which might be what it's supposed to do and my own ignorance prevents me from looking at the problem right?
What I want to do:
I have a few devices that feed a Kinesis Stream, which this Kinesis analytics application is hooked up to.
Now, when a record comes in, what I want to do is SUM a value over the last 24 hours and store that. So after Kinesis Analytics does it's job I'm connecting it to a Lambda to finalize some things.
My issue is, when I simulate sending in some data, 5 records in this case, everything runs, it runs multiple times, not 5. It LOOKS like each time a record comes in it redoes everything in the window (expected) which triggers the lambda for each row that's emitted. As the table grows, it's bad news. What I really want is just the latest value from the window from NOW - 24 HOUR, with the "id" field so I can join that "id" back to a record stored elsewhere.
My Application looks like this:
CREATE OR REPLACE STREAM "DEVICE_STREAM" (
"id" VARCHAR(64),
"timestamp_mark" TIMESTAMP,
"device_id" VARCHAR(64),
"property_a_id" VARCHAR(64),
"property_b_id" VARCHAR(64),
"value" DECIMAL
);
CREATE OR REPLACE PUMP "DEVICE_PUMP" AS
INSERT INTO "DEVICE_STREAM"
SELECT STREAM "id",
"timestamp_mark",
"device_id",
"x_id",
"y_id",
SUM("value") OVER W1 AS "value",
FROM "SOURCE_SQL_STREAM_001"
WINDOW W1 AS (
PARTITION BY "device_id", "property_a_id", "property_b_id" ORDER BY "SOURCE_SQL_STREAM_001".ROWTIME
RANGE INTERVAL '24' HOUR PRECEDING
);
Hmmm.. this might be a better idea, Do the aggregation in a sub-select and select from that. It looks like I need that second window (W2 below) to ensure I get each record that was given back out.
CREATE OR REPLACE STREAM "DEVICE_STREAM" (
"id" VARCHAR(64),
"timestamp_mark" TIMESTAMP,
"device_id" VARCHAR(64),
"property_a_id" VARCHAR(64),
"property_b_id" VARCHAR(64),
"value" DECIMAL
);
CREATE OR REPLACE PUMP "DEVICE_PUMP" AS
INSERT INTO "DEVICE_STREAM"
SELECT STREAM s."id",
s."timestamp_mark",
s."device_id",
s."property_a_id",
s."property_b_id",
v."value"
FROM "SOURCE_SQL_STREAM_001" OVER W2 AS s, (
SELECT STREAM "SOURCE_SQL_STREAM_001"."ROWTIME", "id",
"timestamp_mark",
"device_id",
"property_a_id",
"property_b_id",
SUM("value") OVER W1 AS "value",
FROM "SOURCE_SQL_STREAM_001"
WINDOW W1 AS (
PARTITION BY "device_id", "property_a_id", "property_b_id" ORDER BY "SOURCE_SQL_STREAM_001".ROWTIME
RANGE INTERVAL '24' HOUR PRECEDING
)
) AS v
WHERE s."id" = v."id"
WINDOW W2 AS (
RANGE INTERVAL '1' SECOND PRECEDING
);
Also I notice that if I restart the Kinesis Analytics application, the SUM values reset, so clearly it doesn't persist across restarts, which might make it unsuitable for this solution. I might have to just setup a SQL server and periodically delete old records.
In general using Streaming Analytics solutions (and Kinesis Analytics in particular) is recommended when you need to do something based on the data in the events and not something external like wall clock time.
The reason is simple: if you need to do something once every 24h, you create a job bringing the data from storage (DB) once, performing your task and then "going to sleep" for another 24h - no complexities, manageable overhead. Now if you need to do something based on the data (e.g. when SUM of some field across multiple events exceeds X) you are in trouble with conventional solution since there is no simple criteria for when it should run. If you run it periodically, it might be invoked many times until the data driven criteria is met, creating a clear overhead.
In the latest case Streaming Analytics solution will be used as designed and trigger your logic just when needed, minimizing the overhead.
If you prefer using Streaming Analytics (which I personally don't recommend based on description of your problem), but struggling with Kinesis Analytics syntax, you might consider using Drools Kinesis Analytics. Among its features are crons and collectors, which provide you with very simple way to trigger jobs on time basis.
Note, that my answer is biased since I'm a CTO at Streamx.
Im new in azure analytics. Im using analytics to get feedbacks from users. There are about 50 events that im sending to azure in a second and im trying to get a combined result from two inputs but couldnt get a working output. My problem is in sql query for output.
Now I'm sending in the inputs.
Recommandations:
{"appId":"1","sequentialId":"28","ItemId":"1589018","similaristyValue":"0.104257207028537","orderId":"0"}
ShownLog:
{"appId":"1","sequentialId":"28","ItemId":"1589018"}
I need to join them with sequentialId and ItemId and calculate the difference between two ordered sequential.
For example: I send 10 Recommandations events and after that (like after 2 sec) i send 3 ShownLog event. So what i need to do is i have to get sum of first 3 (because i send 3 shownlog event) event's similaristyValue ordered by "orderid" from "Recommandations". I also need to get the sum of similarityValues from "ShownLog". At the end i need an input like (for every sequential ID):
sequentialID Difference
168 1.21
What i ve done so far is. I save all the inputs my azure sql and i ve managed to write the sql i want. You may find the mssql query for it:
declare #sumofSimValue float;
declare #totalItemCount int;
declare #seqId float;
select
#sumofSimValue = sum(b.[similarityValue]),
#totalItemCount = count(*),
#seqId = a.sequentialId
from EventHubShownLog a inner join EventHubResult b on a.sequentialId=b.sequentialId and a.ItemId=b.ItemId group by a.sequentialId
--select #sumofSimValue,#totalItemCount,#seqId
SELECT #seqId, SUM([similarityValue])-#sumofSimValue
FROM (
SELECT TOP(#totalItemCount) [similarityValue]
FROM [EventHubResult] where sequentialId=#seqId order by orderId
) AS T
But it gives lots of error in analytics. Also it lacks the logic of azure analytcs. I hope i could tell the problem.
Can you tell me how can i do such a job for my system? How can i use the time windows or how can i join them properly?
For every shown log, you have to select sum of similarity value. Is that the intention? Why not just join and select sum? It would only select as many rows as there are shown logs.
One thing to decide is the maximum time difference between recommendation events and shown log events, with that you can use Azure Stream analytics join, https://msdn.microsoft.com/en-us/library/azure/dn835026.aspx
I have a project that uses an event hub to receive data, this is sent every second, the data is received by a website using SignalR, this is all working fine, i have been storing the data in to blob storage via a Stream Analytics Job, but this is really slow to access, and with the amount of data i am receiving off just 6 devices, it will get even slower as this increases, i need to access the data to display historical data on via graphs on the website, and then this is topped up with the live data coming in.
I don't really need to store the data every second, so thought about only storing it every 30 seconds instead, but into a SQL DB, what i am trying to do, is still receive the data every second but only store it every 30, i have tried a tumbling window, but from what i can see, this just dumps everything every 30 seconds instead of the single entries.
am i miss understanding the Tumbling, Sliding and Hopping windows, i am guessing i cannot use them in this way ? if that is the case, i am guessing the only way to do it, would be to have the output db as an input, so i can cross reference the timestamp with the current time ?
unless anyone has any other ideas ? any help would be appreciated.
Thanks
am i miss understanding the Tumbling, Sliding and Hopping windows
You are correct that this will put all events within the Tumbling/Sliding/Hopping window together. However, this is only valid within a group by case, which requires a aggregate function over this group.
There is a aggregate function Collect() which will create an array of the events within a group.
I think this should be possible when you group every event within a 30 second tumbling window using Collect(), then in the next step, CROSS APPLY each record, which should output all received events within the 30 seconds.
With Grouper AS (
SELECT Collect() AS records
FROM Input TIMESTAMP BY time
GROUP BY TumblingWindow(second, 30)
)
SELECT
record.ArrayValue.FieldA AS FieldA,
record.ArrayValue.FieldB AS FieldB
INTO Output
FROM Grouper
CROSS APPLY GetArrayElements(Grouper.records) AS record
If you are trying to aggregate 30 entries into one summary row every 30 seconds then a tumbling window is a good choice. Something like the following should work:
SELECT System.TimeStamp AS OutTime, TollId, COUNT(*) as cnt, sum(TollCharge) as TollCharge
FROM Input TIMESTAMP BY EntryTime
GROUP BY TollId, TumblingWindow(second, 30)
Thanks for the response, I have been speaking to my contact at Microsoft and he suggested something similar, I had also found something like that in various examples online. what I actually want to do, is only update the database with the data every 30 seconds. so I will receive the event, store it, and I will not store it again until 30 seconds have passed. I am not sure how I can do it with and ASA job to be honest, as I need to have a record of the last time it was updated, I actually have a connection to the event hub from my web site, so in the receiver, I am going to perform a simple check, and then store the data from there.
I'm new at the CEP 2.1 and my question is related to time-frame that tha CEP hold on to the input stream
let say that you regularly send data to some input stream let's say "HELLOSTREAM".
for how long does the CEP save the inputs to the stream what is the max time etc...
let say if I send data every day for 365 days will I get back all data on the 366 day or will he truncate the data at some point( will hold ony last 100 days) ? no matter what time-window I set in the query ?
is there a limit ?
CEP is a real-time processing server. It is used to find pre-defined pattern in real-time and for realtime monitoring. It keeps the data in memory and process the events but you can persist the data to cassandra for distributed processing...
Here data will be keep in the memory based on the window size that you defined, It depends on the window type that you are using and time or length given to that window... If you are not using any window it will not keep any data in memory...
If you want to store data for 365 days or 100 days, then it is not a real-time use-case. For that you have to use offline processing server like BAM.
To add to #Mohanadarshan's answer, if what you want is to extract and store some values from a fast event stream over a long period, a better CEP-based approach will be to use persistent event tables (which will be included in the upcoming CEP version 3.0.0 which will be released soon). This way you'll be able to do real-time processing against some extracted and persisted data. But as #Mohanadarshan has mentioned, if your requirement is batch processing (and if you do not need to detect anything real-time), WSO2 BAM will be a better option.
Using sliding windows over a very long period to store large amounts of data is not a good idea as they are stored in memory and also you'll loose data if server goes down.