How to do calculations/ comparisons on 2 windows? - wso2

I'm collecting events with different id, there are n type of fixed ids in incoming events. I want to collect average of past events based on time-frame or no. of events between different type of ids.
Let's say, there are 2 devices sending data/ event with id 'a' and 'b'. I want to get average of past 5 minutes of data for both devices and then compare both averages to make some decision.
By this code, I'm collecting data of past n minutes of data and storing in 2 windows.
`
#source(type='http', receiver.url='http://localhost:5007/SweetProductionEP', #map(type = 'json'))
define stream InProduction(name string, amount int);
define window hold_a(avg_amount double) length(1);
define window hold_b(avg_amount double) length(1);
from InProduction[name=='a']#window.timeBatch(5 min)
select avg(amount) as avg_amount
group by name
insert into hold_a;
from InProduction[name=='b']#window.timeBatch(5 min)
select avg(amount) as avg_amount
group by name
insert into hold_b;`
window hold_a and hold_b will get average data of past 5 min. Now I want to compare data from both windows and take decision.
I've tried join on both windows but join query doesn't get executed.

You have to use a pattern to achieve this. Below query with output the name which had the highest average into highestAvgStream.
#source(type='http', receiver.url='http://localhost:5007/SweetProductionEP', #map(type = 'json'))
define stream InProduction(name string, amount int);
from InProduction[name=='a']#window.timeBatch(5 min)
select avg(amount) as avg_amount, name
insert into avgStream;
from InProduction[name=='b']#window.timeBatch(5 min)
select avg(amount) as avg_amount, name
insert into avgStream;`
from every(e1=avgStream -> e2=avgStream)
select ifthenelse(e1.avg_amount>e2.avg_amount,e1.name,e2.name) as highestAvgName
insert into HighestAvgStream;

Related

Kinesis Analytics Session or Stagger Window Batching Without Aggregation

I'm looking to use Kinesis Data Analytics (or some other AWS managed service) to batch records based on a filter criteria. The idea would be that as records come in, we'd start a session window and batch any matching records for 15 min.
The stagger window is exactly what we'd like except we're not looking to aggregate the data, but rather just return the records all together.
Ideally...
100 records spread over 15 min. (20 matching criteria) with first one at 10:02
|
v
At 10:17, the 20 matching records would be sent to the destination
I've tried doing something like:
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
"device_id" INTEGER,
"child_id" INTEGER,
"domain" VARCHAR(32),
"category_id" INTEGER,
"posted_at" DOUBLE,
"block" TIMESTAMP
);
-- Create pump to insert into output
CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"
-- Select all columns from source stream
SELECT STREAM
"device_id",
"child_id",
"domain",
"category_id",
"posted_at",
FLOOR("SOURCE_SQL_STREAM_001".ROWTIME TO MINUTE) as block
FROM "SOURCE_SQL_STREAM_001"
WHERE "category_id" = 888815186
WINDOWED BY STAGGER (
PARTITION BY "child_id", FLOOR("SOURCE_SQL_STREAM_001".ROWTIME TO MINUTE)
RANGE INTERVAL '15' MINUTE);
I continue to get errors for all the columns not in the aggregation:
From line 6, column 5 to line 6, column 12: Expression 'domain' is not being used in PARTITION BY sub clause of WINDOWED BY clause
Kinesis Firehose was a suggested solution, but it's a blind window to all child_id, so it could possibly cut up a session in to multiple and that's what I'm trying to avoid.
Any suggestions? Feels like this might not be the right tool.
try LAST_VALUE("domain") as domain in the select clause.

Stream Analytics Aggregation Window

I need help \ advice on how to ignore old events when performing aggregation over an extended window. I have sale data that is streaming into Event Hub.
Event hub is used as as Input stream. I need to produce two metrices
- 30 sec aggregation ( tumbling )
- Whole day aggregated sales value i.e. from Gate open
Gate open time is variable (dynamic) hence I read reference dataset off the blob; and join the Gateopen datetime to sales stream.
The 30 sec aggregation over the tumbling window works fine.
Given the gate open is variable; I am currently using 12 hour Hopping window with 30 sec hop and trying to limit the event to be aggregated by using EventProcessDatetime > GateOpen logic.
SELECT
Dateadd(ss,-30,System.Timestamp ) AS TimeSliceUTCStart
, System.Timestamp AS TimeSliceUTCEnd
, p.Section AS Section
, SUM(CASE WHEN p.Classification = 'Retail'
AND p.ActivityDateTime > p.GateOpen THEN p.[sales_amt_gross] ELSE 0 END) AS SaleTotalRetail
FROM FilteredBase p
GROUP BY
p.Section
, HoppingWindow(Duration(Hour, 12), hop(second, 30),Offset(millisecond, -1))
Problem: I am getting sales aggregated from the previous day\timeslice.
Overall the outcome I am trying to achieve is simple. The store could be open for 5,8,10 or 12 hour max. We want to be able to know sales as in Live stream, for each section as the day progresses. Any advise or tip will be much appreciated.
Intuitively the query looks good, but what happens under the cover is that Azure Stream Analytics is using the reference data file that was valid at the time of each time window. Then, when it sees the even of the previous day, it will use the reference data present at that time (which may make the comparison p.ActivityDateTime > p.GateOpen True for the previous opening time).
I modified the query as followed (supposing you have 1 open event per day per section). Let me know if it works for you. If it doesn't, can you send some sample data so I can modify the query accordingly. We will investigate to see how to make these queries easier to write.
WITH thirdtysecReporting AS
(
SELECT
p.Section Section,
DATETIMEFROMPARTS(DATEPART(year, System.Timestamp), DATEPART(month, System.Timestamp), DATEPART(day, System.Timestamp), 0, 0, 0, 0) as date,
System.Timestamp Windowend,
SUM(p.sales_amt_gross) thirtysecSales
FROM input TIMESTAMP BY p.ActivityDateTime
GROUP BY TumblingWindow(second, 30), p.Section
)
,hopping AS
(
SELECT
Section,
System.Timestamp HopEnd,
date,
SUM(thirtysecSales) SumSales
FROM thirdtysecReporting
GROUP BY HoppingWindow(second, 86400, 30), Section, date -- Hopping on 24 hours, reported every 30 second
)
,filtered as -- This step ignores data from the previous day
(
SELECT
Section,
HopEnd,
date,
SUMQt = CASE
WHEN DAY(HopEnd) = DAY(date) OR DATEPART(hour, HopEnd) = DATEPART(hour, date) THEN SumSales
ELSE 0
END
FROM hopping
)
SELECT Section, -- Final query
HopEnd,
MAX(SUMQt) AS SumQt
FROM filtered
GROUP BY TumblingWindow(hour, 1), Section, hopend
Thanks,
JS - Azure Stream Analytics

Power BI Dashboard where the core filter condition is a disjunction on numeric fields

We are trying to implement a dashboard that displays various tables, metrics and a map where the dataset is a list of customers. The primary filter condition is the disjunction of two numeric fields. We want to the user to be able to select a threshold for [field 1] and a separate threshold for [field 2] and then impose the condition [field 1] >= <threshold> OR [field 2] >= <threshold>.
After that, we want to also allow various other interactive slicers so the user can restrict the data further, e.g. by country or account manager.
Power BI naturally imposes AND between all filters and doesn't have a neat way to specify OR. Can you suggest a way to define a calculation using the two numeric fields that is then applied as a filter within the same interactive dashboard screen? Alternatively, is there a way to first prompt the user for the two threshold values before the dashboard is displayed -- so when they click Submit on that parameter-setting screen they are then taken to the main dashboard screen with the disjunction already applied?
Added in response to a comment:
The data can be quite simple: no complexity there. The complexity is in getting the user interface to enable a disjunction.
Suppose the data was a list of customers with customer id, country, gender, total value of transactions in the last 12 months, and number of purchases in last 12 months. I want the end-user (with no technical skills) to specify a minimum threshold for total value (e.g. $1,000) and number of purchases (e.g. 10) and then restrict the data set to those where total value of transactions in the last 12 months > $1,000 OR number of purchases in last 12 months > 10.
After doing that, I want to allow the user to see the data set on a dashboard (e.g. with a table and a graph) and from there select other filters (e.g. gender=male, country=Australia).
The key here is to create separate parameter tables and combine conditions using a measure.
Suppose we have the following Sales table:
Customer Value Number
-----------------------
A 568 2
B 2451 12
C 1352 9
D 876 6
E 993 11
F 2208 20
G 1612 4
Then we'll create two new tables to use as parameters. You could do a calculated table like
Number = VALUES(Sales[Number])
Or something more complex like
Value = GENERATESERIES(0, ROUNDUP(MAX(Sales[Value]),-2), ROUNDUP(MAX(Sales[Value]),-2)/10)
Or define the table manually using Enter Data or some other way.
In any case, once you have these tables, name their columns what you want (I used MinNumber and MinValue) and write your filtering measure
Filter = IF(MAX(Sales[Number]) > MIN(Number[MinCount]) ||
MAX(Sales[Value]) > MIN('Value'[MinValue]),
1, 0)
Then put your Filter measure as a visual level filter where Filter is not 0 and use MinCount and MinValues column as slicers.
If you select 10 for MinCount and 1000 for MinValue then your table should look like this:
Notice that E and G only exceed one of the thresholds and tha A and D are excluded.
To my knowledge, there is no such built-in slicer feature in Power BI at the time being. There is however a suggestion in the Power BI forum that requests a functionality like this. If you'd be willing to use the Power Query Editor, it's easy to obtain the values you're looking for, but only for hard-coded values for your limits or thresh-holds.
Let me show you how for a synthetic dataset that should fit the structure of your description:
Dataset:
CustomerID,Country,Gender,TransactionValue12,NPurchases12
51,USA,M,3516,1
58,USA,M,3308,12
57,USA,M,7360,19
54,USA,M,2052,6
51,USA,M,4889,5
57,USA,M,4746,6
50,USA,M,3803,3
58,USA,M,4113,24
57,USA,M,7421,17
58,USA,M,1774,24
50,USA,F,8984,5
52,USA,F,1436,22
52,USA,F,2137,9
58,USA,F,9933,25
50,Canada,F,7050,16
56,Canada,F,7202,5
54,Canada,F,2096,19
59,Canada,F,4639,9
58,Canada,F,5724,25
56,Canada,F,4885,5
57,Canada,F,6212,4
54,Canada,F,5016,16
55,Canada,F,7340,21
60,Canada,F,7883,6
55,Canada,M,5884,12
60,UK,M,2328,12
52,UK,M,7826,1
58,UK,M,2542,11
56,UK,M,9304,3
54,UK,M,3685,16
58,UK,M,6440,16
50,UK,M,2469,13
57,UK,M,7827,6
Desktop table:
Here you see an Input table and a subset table using two Slicers. If the forum suggestion gets implemented, it should hopefully be easy to change a subset like below to an "OR" scenario:
Transaction Value > 1000 OR Number or purchases > 10 using Power Query:
If you use Edit Queries > Advanced filter you can set it up like this:
The last step under Applied Steps will then contain this formula:
= Table.SelectRows(#"Changed Type2", each [NPurchases12] > 10 or [TransactionValue12] > 1000
Now your original Input table will look like this:
Now, if only we were able to replace the hardcoded 10 and 1000 with a dynamic value, for example from a slicer, we would be fine! But no...
I know this is not what you were looking for, but it was the best 'negative answer' I could find. I guess I'm hoping for a better solution just as much as you are!

wso2 cep - SiddhiQL - sum different rows from the same event table

I have one event table called eventCount and has the following values:
ID | eventCount
1 3
2 1
3 5
4 1
I have a stream of data coming in where I count the values of a certain type for a time period (1 second) and depending on the type and time period I will count() and write the value of the count() in the correspondent row.
I need to make a sum of the values within the event table.
I tried to create another event table and join both. Although I am getting the error of you cannot join from 2 static sources.
What is the correct way of doing this from SIddiQL in WSO2 CEP
In your scenario, Sum of the values in the event table is equivalent to the total number of events, doesn't it? So why you need to keep it an event table, can't you just it then and there (like below)?
#Import('dataIn:1.0.0')
define stream dataIn (id int);
#Export('totalCountStream:1.0.0')
define stream totalCountStream (eventCount long);
#Export('perIdCountStream:1.0.0')
define stream perIdCountStream (id int, eventCount long);
partition with (id of dataIn)
begin
from dataIn#window.time(5 sec)
select id, count() as eventCount
insert into perIdCountStream;
end;
from dataIn#window.time(5 sec)
select count() as eventCount
insert into totalCountStream;
ps: if you really need the event tables, you can always persist totalCountStream and perIdCountStream in two separate tables.

How to update a stream with the response from another stream where the sink type is "http-response"

Am trying to enrich my input stream with an additional attribute which gets populated via "http-response" response sink.
I have tried using the "join" with window attribute and with "every" keyword to merge two streams and inserting the resulting merged stream into another stream to enrich it.
The window attributes (window.time(1 sec) or window.length(1)) and "every" keyword works well when the incoming events are coming at a regular interval of 1 sec or more.
When (say for example 10 or 100) events are sent at the same time(within a second). Then the result of the merge is not in expected terms.
The one with "window" attribute (join)
**
from EventInputStreamOne#window.time(1 sec) as i
join EventInputStreamTwo as s
on i.variable2 == s.variable2
select i.variable1 as variable1, i.variable2 as variable2, s.variable2 as variable2
insert into EventOutputStream;
**
The one with the "every" keyword
**
from every e1=EventInputStream,e2=EventResponseStream
select e1.variable1 as variable1, e1.variable2 as variable2, e2.variable3 as variable3
insert into EventOutputStream;
**
Is there any better way to merge the two streams in order to update a third stream?
To get the original request attributes, you can use custom mapping as follows,
#source(type='http-call-response', sink.id='source-1'
#map(type='json',#attributes(name='name', id='id', volume='trp:volume', price='trp:price')))
define stream responseStream(name String, id int, headers String, volume long, price float);
Here, the request attributes can be accessed with trp:attributeName, in this sample only name is from the response, price and volume is from the request.
The syntax in your 'every' keyword approach isn't quite right. Have you tried something like this:
from every (e1 = event1) -> e2=event2[e1.variable == e2.variable]
select e1.variable1, e2.variable1, e2.variable2
insert into outputEvent;
This document might help.