WSO2 cep: How to deal with multiple joins? - wso2

I'm try to solve simple task:
1. I want to correlate occurrence of 3 events A, B, C in case they happen in last 10 seconds.
Thus Siddhi supports only 2 join in query, I think that I'm not able to solve it. In documentation there's suggestion to use multiple queries and join them together like this
from A#window.time(10 sec) as a
join B#window.time(10 sec) as b on a.id == b.id
select a.id
insert into tempA
from tempA#window.time(10 sec) as a
join C#window.time(10 sec) as c on c.id == a.id
select *
insert into finalResult
But this produces wrong results, because data in stream tempA can live longer, time windows are not aligned.
Maybe I'm something missing. Any advice?
Thanks

To solve this, you can try the following approach:
For each incoming event, add a timestamp. (You can also do this in your client itself as well instead of doing it in cep.)
Replace the time windows with external time windows
Use the previously added timestamp field as the reference of time for external time windows
Since the timestamps will be global in this case and all the external time windows will be operating according to them, this should work properly.

Related

AWS IoT Analytics Delta Window

I am having real problems getting the AWS IoT Analytics Delta Window (docs) to work.
I am trying to set it up so that every day a query is run to get the last 1 hour of data only. According to the docs the schedule feature can be used to run the query using a cron expression (in my case every hour) and the delta window should restrict my query to only include records that are in the specified time window (in my case the last hour).
The SQL query I am running is simply SELECT * FROM dev_iot_analytics_datastore and if I don't include any delta window I get the records as expected. Unfortunately when I include a delta expression I get nothing (ever). I left the data accumulating for about 10 days now so there are a couple of million records in the database. Given that I was unsure what the optimal format would be I have included the following temporal fields in the entries:
datetime : 2019-05-15T01:29:26.509
(A string formatted using ISO Local Date Time)
timestamp_sec : 1557883766
(A unix epoch expressed in seconds)
timestamp_milli : 1557883766509
(A unix epoch expressed in milliseconds)
There is also a value automatically added by AWS called __dt which is a uses the same format as my datetime except it seems to be accurate to within 1 day. i.e. All values entered within a given day have the same value (e.g. 2019-05-15 00:00:00.00)
I have tried a range of expressions (including the suggested AWS expression) from both standard SQL and Presto as I'm not sure which one is being used for this query. I know they use a subset of Presto for the analytics so it makes sense that they would use it for the delta but the docs simply say '... any valid SQL expression'.
Expressions I have tried so far with no luck:
from_unixtime(timestamp_sec)
from_unixtime(timestamp_milli)
cast(from_unixtime(unixtime_sec) as date)
cast(from_unixtime(unixtime_milli) as date)
date_format(from_unixtime(timestamp_sec), '%Y-%m-%dT%h:%i:%s')
date_format(from_unixtime(timestamp_milli), '%Y-%m-%dT%h:%i:%s')
from_iso8601_timestamp(datetime)
What are the offset and time expression parameters that you are using?
Since delta windows are effectively filters inserted into your SQL, you can troubleshoot them by manually inserting the filter expression into your data set's query.
Namely, applying a delta window filter with -3 minute (negative) offset and 'from_unixtime(my_timestamp)' time expression to a 'SELECT my_field FROM my_datastore' query translates to an equivalent query:
SELECT my_field FROM
(SELECT * FROM "my_datastore" WHERE
(__dt between date_trunc('day', iota_latest_succeeded_schedule_time() - interval '1' day)
and date_trunc('day', iota_current_schedule_time() + interval '1' day)) AND
iota_latest_succeeded_schedule_time() - interval '3' minute < from_unixtime(my_timestamp) AND
from_unixtime(my_timestamp) <= iota_current_schedule_time() - interval '3' minute)
Try using a similar query (with no delta time filter) with correct values for offset and time expression and see what you get, The (_dt between ...) is just an optimization for limiting the scanned partitions. You can remove it for the purposes of troubleshooting.
Please try the following:
Set query to SELECT * FROM dev_iot_analytics_datastore
Data selection filter:
Data selection window: Delta time
Offset: -1 Hours
Timestamp expression: from_unixtime(timestamp_sec)
Wait for dataset content to run for a bit, say 15 minutes or more.
Check contents
After several weeks of testing and trying all the suggestions in this post along with many more it appears that the extremely technical answer was to 'switch off and back on'. I deleted the whole analytics stack and rebuild everything with different names and it now seems to now be working!
Its important that even though I have flagged this as the correct answer due to the actual resolution. Both the answers provided by #Populus and #Roger are correct had my deployment being functioning as expected.
I found by chance that changing SELECT * FROM datastore to SELECT id1, id2, ... FROM datastore solved the problem.

SQL not returning when executed on top of a large data set

I have below sql which is getting stuck in oracle database for more than 2 hours. This stuck happens only when it is executed via the C++ application. Interestingly, at the same time when it was stuck I can execute it through sql developer manually and it returns within seconds. My table has millions of rows and about 100 columns. Can someone please point out how can I overcome this issue?
select *
from MY_TABLE
INNER JOIN ( (select max(concat(DATE ,concat('',to_char(INDEX, '0000000000')))) AS UNIQUE_ID
from MY_TABLE
WHERE ((DATE < '2018/01/29')
OR (DATE = '2018/01/29' AND INDEX <= 100000))
AND EXISTS ( select ID
from MY_TABLE
where DATE = '2018/01/29'
AND INDEX > 100000
AND LATEST =1)
group by ID ) SELECTED_SET )
ON SELECTED_SET.UNIQUE_ID = concat(DATE, concat('',to_char(INDEX, '0000000000')))
WHERE (FIELD_1 = 1 AND FIELD_2 = 1 AND FIELD_3='SomeString');
UPDATE:
db file sequential read is present on the session.
SELECT p3, count(*) FROM v$session_wait WHERE event='db file sequential read' GROUP BY p3;
.......................................
| P3 | COUNT(*) |
.......................................
| 1 | 2 |
.......................................
"I can execute it through sql developer manually and it returns within seconds"
Clearly the problem is not intrinsic to the query. So it must be a problem with your application.
Perhaps you have a slow network connection between your C++ application and the database. To check this you should talk to your network admin team. They are likely to be resistant to the suggestion that the network is the problem. So you may need to download and install Wireshark, and investigate it yourself.
Or your C++ is just very inefficient in handling the data. Is the code instrumented? Do you know what it's been doing for those two hours?
"the session is shown as 'buffer busy wait'"
Buffer busy waits indicate contention for blocks between sessions. If your application has many sessions running this query then you may have a problem. Buffer busy waits can indicate that there are sessions waiting on a full table scan to complete; but as the query returned results when you ran it in SQL Developer I think we can discount this. Perhaps there are other sessions updating MY_TABLE. How many sessions are reading or writing to it?
Also, what is the output of this query?
SELECT p3, count(*)
FROM v$session_wait
WHERE event='buffer busy wait'
GROUP BY p3
;
Worked with our DBA and he disabled the plan directives at system level using
alter system set "_optimizer_dsdir_usage_control"=0;
As per him, SQL plan directives were created as cardinality mis-estimates after executing the sql. After that timing was greatly improved and the problem is solved.

dax code for count and distinct count measures (or calculated columns)

I hope somebody can help me with some hints for the following analysis. The students may do some actions for some courses (enroll, join, grant,...) and also the reverse - to cancel the latest action.
The first metric is to count all the action occurred in the system between two dates - these are exposed like a filter/slicer.
Some sample data :
person-id,person-name,course-name,event,event-rank,startDT,stopDT
11, John, CS101, enrol,1,2000-01-01,2000-03-31
11, John, CS101, grant,2,2000-04-01,2000-04-30
11, John, CS101, cancel,3,2000-04-01,2000-04-30
11, John, PHIL, enrol, 1, 2000-02-01,2000-03-31
11, John, PHIL, grant, 2, 2000-04-01,2000-04-30
The data set (ds) is above and I have added the following code for the count metric:
evaluate
sumx(
addcolumns( ds
,"z+", if([event] <> "cancel",1,0)
,"z-", if([event] = "cancel",-1,0)
)
,[z+] + [z-])
}
The metric should display : 3 subscriptions (John-CS101 = 1 , John-PHIL=2).
There are some other rules but I don't know how to add them to the DAX code, the cancel date is the same as the above action (non-cancel) and the rank of the cancel-action = the non-cancel-action + 1.
Also there is a need for adding the number for distinct student and course, the composite key . How to add this to the code, please ? (via summarize, rankx)
Regards,
Q
This isn't technically an answer, but more of a recommendation.
It sounds like your challenge is that you have actions that may then be cancelled. There is specific logic that determines whether an action is cancelled or not (i.e. the cancellation has to be the immediate next row and the dates must match).
What I would recommend, which doesn't answer your specific question, is to adjust your data model rather than put the cancellation logic in DAX.
For example, if you could add a column to your data model that flags a row as subsequently cancelled, then all DAX has to do is check that flag to know if an action is cancelled or not. A CALCULATE statement. You don't have to have lots of logic to determine whether the event was cancelled. You entirely eliminate the need for SUMX, which can be slow when working with a lot of rows since it works row by row.
The logic for whether an action is cancelled or not moves to your source system (e.g. SQL or even a calculated column in Excel), or to your ETL (e.g. the Query Editor in Power BI) which are better equipped for such tasks. The logic is applied 1 time and then exists in your data model for all measures, instead of needing to apply the logic each time a measure is used.
I know this doesn't help you solve your logic question, but the reason I make this recommendation is that DAX is fundamentally a giant calculator. It adds things up. It's great at filters (adding some things up but not others), but it works best when everything is reduced to columns that it can sum or count. Once you go beyond that (e.g. wanting to look at the row below to adjust something about the current row), your DAX is going to get very complicated (and slow), whereas a source system or the Query Editor will likely be able to handle such requirements more easily.

Which query is faster : top X or limit X when using order by in Amazon Redshift

3 options, on a table of events that are inserted by a timestamp.
Which query is faster/better?
Select a,b,c,d,e.. from tab1 order by timestamp desc limit 100
Select top 100 a,b,c,d,e.. from tab1 order by timestamp desc
Select top 100 a,b,c,d,e.. from tab1 order by timestamp desc limit 100
When you ask a question like that, EXPLAIN syntax is helpful. Just add this keyword at the beginning of your query and you will see a query plan. In cases 1 and 2 the plans will be absolutely identical. These are variations of SQL syntax but the internal interpreter of SQL should produce the same query plan according to which requested operations will be performed physically.
More about EXPLAIN command here: EXPLAIN in Redshift
You can get the result by running these queries on a sample dataset. Here are my observations:
Type 1: 5.54s, 2.42s, 1.77s, 1.76s, 1.76s, 1.75s
Type 2: 5s, 1.77s, 1s, 1.75s, 2s, 1.75s
Type 3: Is an invalid SQL statement as you are using two LIMIT clauses
As you can observe, the results are the same for both the queries as both undergo internal optimization by the query engine.
Apparently both TOP and LIMIT do a similar job, so you shouldn't be worrying about which one to use.
More important is the design of your underlying table, especially if you are using WHERE and JOIN clauses. In that case, you should be carefully choosing your SORTKEY and DISTKEY, which will have much more impact on the performance of Amazon Redshift that a simple syntactical difference like TOP/LIMIT.

Joining inputs for a complicated output

Im new in azure analytics. Im using analytics to get feedbacks from users. There are about 50 events that im sending to azure in a second and im trying to get a combined result from two inputs but couldnt get a working output. My problem is in sql query for output.
Now I'm sending in the inputs.
Recommandations:
{"appId":"1","sequentialId":"28","ItemId":"1589018","similaristyValue":"0.104257207028537","orderId":"0"}
ShownLog:
{"appId":"1","sequentialId":"28","ItemId":"1589018"}
I need to join them with sequentialId and ItemId and calculate the difference between two ordered sequential.
For example: I send 10 Recommandations events and after that (like after 2 sec) i send 3 ShownLog event. So what i need to do is i have to get sum of first 3 (because i send 3 shownlog event) event's similaristyValue ordered by "orderid" from "Recommandations". I also need to get the sum of similarityValues from "ShownLog". At the end i need an input like (for every sequential ID):
sequentialID Difference
168 1.21
What i ve done so far is. I save all the inputs my azure sql and i ve managed to write the sql i want. You may find the mssql query for it:
declare #sumofSimValue float;
declare #totalItemCount int;
declare #seqId float;
select
#sumofSimValue = sum(b.[similarityValue]),
#totalItemCount = count(*),
#seqId = a.sequentialId
from EventHubShownLog a inner join EventHubResult b on a.sequentialId=b.sequentialId and a.ItemId=b.ItemId group by a.sequentialId
--select #sumofSimValue,#totalItemCount,#seqId
SELECT #seqId, SUM([similarityValue])-#sumofSimValue
FROM (
SELECT TOP(#totalItemCount) [similarityValue]
FROM [EventHubResult] where sequentialId=#seqId order by orderId
) AS T
But it gives lots of error in analytics. Also it lacks the logic of azure analytcs. I hope i could tell the problem.
Can you tell me how can i do such a job for my system? How can i use the time windows or how can i join them properly?
For every shown log, you have to select sum of similarity value. Is that the intention? Why not just join and select sum? It would only select as many rows as there are shown logs.
One thing to decide is the maximum time difference between recommendation events and shown log events, with that you can use Azure Stream analytics join, https://msdn.microsoft.com/en-us/library/azure/dn835026.aspx