Joining inputs for a complicated output - azure-eventhub

Im new in azure analytics. Im using analytics to get feedbacks from users. There are about 50 events that im sending to azure in a second and im trying to get a combined result from two inputs but couldnt get a working output. My problem is in sql query for output.
Now I'm sending in the inputs.
Recommandations:
{"appId":"1","sequentialId":"28","ItemId":"1589018","similaristyValue":"0.104257207028537","orderId":"0"}
ShownLog:
{"appId":"1","sequentialId":"28","ItemId":"1589018"}
I need to join them with sequentialId and ItemId and calculate the difference between two ordered sequential.
For example: I send 10 Recommandations events and after that (like after 2 sec) i send 3 ShownLog event. So what i need to do is i have to get sum of first 3 (because i send 3 shownlog event) event's similaristyValue ordered by "orderid" from "Recommandations". I also need to get the sum of similarityValues from "ShownLog". At the end i need an input like (for every sequential ID):
sequentialID Difference
168 1.21
What i ve done so far is. I save all the inputs my azure sql and i ve managed to write the sql i want. You may find the mssql query for it:
declare #sumofSimValue float;
declare #totalItemCount int;
declare #seqId float;
select
#sumofSimValue = sum(b.[similarityValue]),
#totalItemCount = count(*),
#seqId = a.sequentialId
from EventHubShownLog a inner join EventHubResult b on a.sequentialId=b.sequentialId and a.ItemId=b.ItemId group by a.sequentialId
--select #sumofSimValue,#totalItemCount,#seqId
SELECT #seqId, SUM([similarityValue])-#sumofSimValue
FROM (
SELECT TOP(#totalItemCount) [similarityValue]
FROM [EventHubResult] where sequentialId=#seqId order by orderId
) AS T
But it gives lots of error in analytics. Also it lacks the logic of azure analytcs. I hope i could tell the problem.
Can you tell me how can i do such a job for my system? How can i use the time windows or how can i join them properly?

For every shown log, you have to select sum of similarity value. Is that the intention? Why not just join and select sum? It would only select as many rows as there are shown logs.
One thing to decide is the maximum time difference between recommendation events and shown log events, with that you can use Azure Stream analytics join, https://msdn.microsoft.com/en-us/library/azure/dn835026.aspx

Related

AWS CloudWatch: visualise more time series

I would like to visualise event occurrence changes in time.
Use case:
Let's say my logs contains 2 types of events (eventA, eventB).
I'm interested in a line graph that shows the number of events per hours. (line#1: dataA1, dataA2... ; line#2: dataB1, dataB2...)
What I'm aware of:
Query the logs: fields #timestamp, eventName | stats count() by bin(1h), eventName | sort bin(1h) asc
The above query gives all the data for creating the desired graph (eg: [bin(1h)], [count()], [eventName])
If I remove the eventName field form display I get a log-table with the correct data, but the line graph is showing datapoints mixed (eg: dataA1, dataA2, dataB3, dataA4, dataB5)
The question:
Is it possible to generate a line graph with more series in it?
If yes, what parametrization do I need?
See https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_Insights-Visualizing-Log-Data.html
Visualizing time series data
Time series visualizations work for queries with the following characteristics:
The query contains one or more aggregation functions. For more information, see Aggregation Functions in the Stats Command.
The query uses the bin() function to group the data by one field.
These queries can produce line charts, stacked area charts, bar charts, and pie charts.
You can't use line chart for your example because you can only use single bin() grouping to produce time series. You can however use e.g. pie chart for your use case.
Alternatively if applicable to your use case, you can start producing logs in different format as
{
"eventA": 1,
"eventB": 0
}
Then you can write query as
stats sum(eventA), sum(eventB) by bin(1h)

AWS IoT Analytics Delta Window

I am having real problems getting the AWS IoT Analytics Delta Window (docs) to work.
I am trying to set it up so that every day a query is run to get the last 1 hour of data only. According to the docs the schedule feature can be used to run the query using a cron expression (in my case every hour) and the delta window should restrict my query to only include records that are in the specified time window (in my case the last hour).
The SQL query I am running is simply SELECT * FROM dev_iot_analytics_datastore and if I don't include any delta window I get the records as expected. Unfortunately when I include a delta expression I get nothing (ever). I left the data accumulating for about 10 days now so there are a couple of million records in the database. Given that I was unsure what the optimal format would be I have included the following temporal fields in the entries:
datetime : 2019-05-15T01:29:26.509
(A string formatted using ISO Local Date Time)
timestamp_sec : 1557883766
(A unix epoch expressed in seconds)
timestamp_milli : 1557883766509
(A unix epoch expressed in milliseconds)
There is also a value automatically added by AWS called __dt which is a uses the same format as my datetime except it seems to be accurate to within 1 day. i.e. All values entered within a given day have the same value (e.g. 2019-05-15 00:00:00.00)
I have tried a range of expressions (including the suggested AWS expression) from both standard SQL and Presto as I'm not sure which one is being used for this query. I know they use a subset of Presto for the analytics so it makes sense that they would use it for the delta but the docs simply say '... any valid SQL expression'.
Expressions I have tried so far with no luck:
from_unixtime(timestamp_sec)
from_unixtime(timestamp_milli)
cast(from_unixtime(unixtime_sec) as date)
cast(from_unixtime(unixtime_milli) as date)
date_format(from_unixtime(timestamp_sec), '%Y-%m-%dT%h:%i:%s')
date_format(from_unixtime(timestamp_milli), '%Y-%m-%dT%h:%i:%s')
from_iso8601_timestamp(datetime)
What are the offset and time expression parameters that you are using?
Since delta windows are effectively filters inserted into your SQL, you can troubleshoot them by manually inserting the filter expression into your data set's query.
Namely, applying a delta window filter with -3 minute (negative) offset and 'from_unixtime(my_timestamp)' time expression to a 'SELECT my_field FROM my_datastore' query translates to an equivalent query:
SELECT my_field FROM
(SELECT * FROM "my_datastore" WHERE
(__dt between date_trunc('day', iota_latest_succeeded_schedule_time() - interval '1' day)
and date_trunc('day', iota_current_schedule_time() + interval '1' day)) AND
iota_latest_succeeded_schedule_time() - interval '3' minute < from_unixtime(my_timestamp) AND
from_unixtime(my_timestamp) <= iota_current_schedule_time() - interval '3' minute)
Try using a similar query (with no delta time filter) with correct values for offset and time expression and see what you get, The (_dt between ...) is just an optimization for limiting the scanned partitions. You can remove it for the purposes of troubleshooting.
Please try the following:
Set query to SELECT * FROM dev_iot_analytics_datastore
Data selection filter:
Data selection window: Delta time
Offset: -1 Hours
Timestamp expression: from_unixtime(timestamp_sec)
Wait for dataset content to run for a bit, say 15 minutes or more.
Check contents
After several weeks of testing and trying all the suggestions in this post along with many more it appears that the extremely technical answer was to 'switch off and back on'. I deleted the whole analytics stack and rebuild everything with different names and it now seems to now be working!
Its important that even though I have flagged this as the correct answer due to the actual resolution. Both the answers provided by #Populus and #Roger are correct had my deployment being functioning as expected.
I found by chance that changing SELECT * FROM datastore to SELECT id1, id2, ... FROM datastore solved the problem.

How to get only the latest row from a window

I am working with Kinesis Analytics and I am trying to understand how to write my application to give me a sliding window over 24 hours. What I have generates the right data, but it looks like it regenerates it every time, which might be what it's supposed to do and my own ignorance prevents me from looking at the problem right?
What I want to do:
I have a few devices that feed a Kinesis Stream, which this Kinesis analytics application is hooked up to.
Now, when a record comes in, what I want to do is SUM a value over the last 24 hours and store that. So after Kinesis Analytics does it's job I'm connecting it to a Lambda to finalize some things.
My issue is, when I simulate sending in some data, 5 records in this case, everything runs, it runs multiple times, not 5. It LOOKS like each time a record comes in it redoes everything in the window (expected) which triggers the lambda for each row that's emitted. As the table grows, it's bad news. What I really want is just the latest value from the window from NOW - 24 HOUR, with the "id" field so I can join that "id" back to a record stored elsewhere.
My Application looks like this:
CREATE OR REPLACE STREAM "DEVICE_STREAM" (
"id" VARCHAR(64),
"timestamp_mark" TIMESTAMP,
"device_id" VARCHAR(64),
"property_a_id" VARCHAR(64),
"property_b_id" VARCHAR(64),
"value" DECIMAL
);
CREATE OR REPLACE PUMP "DEVICE_PUMP" AS
INSERT INTO "DEVICE_STREAM"
SELECT STREAM "id",
"timestamp_mark",
"device_id",
"x_id",
"y_id",
SUM("value") OVER W1 AS "value",
FROM "SOURCE_SQL_STREAM_001"
WINDOW W1 AS (
PARTITION BY "device_id", "property_a_id", "property_b_id" ORDER BY "SOURCE_SQL_STREAM_001".ROWTIME
RANGE INTERVAL '24' HOUR PRECEDING
);
Hmmm.. this might be a better idea, Do the aggregation in a sub-select and select from that. It looks like I need that second window (W2 below) to ensure I get each record that was given back out.
CREATE OR REPLACE STREAM "DEVICE_STREAM" (
"id" VARCHAR(64),
"timestamp_mark" TIMESTAMP,
"device_id" VARCHAR(64),
"property_a_id" VARCHAR(64),
"property_b_id" VARCHAR(64),
"value" DECIMAL
);
CREATE OR REPLACE PUMP "DEVICE_PUMP" AS
INSERT INTO "DEVICE_STREAM"
SELECT STREAM s."id",
s."timestamp_mark",
s."device_id",
s."property_a_id",
s."property_b_id",
v."value"
FROM "SOURCE_SQL_STREAM_001" OVER W2 AS s, (
SELECT STREAM "SOURCE_SQL_STREAM_001"."ROWTIME", "id",
"timestamp_mark",
"device_id",
"property_a_id",
"property_b_id",
SUM("value") OVER W1 AS "value",
FROM "SOURCE_SQL_STREAM_001"
WINDOW W1 AS (
PARTITION BY "device_id", "property_a_id", "property_b_id" ORDER BY "SOURCE_SQL_STREAM_001".ROWTIME
RANGE INTERVAL '24' HOUR PRECEDING
)
) AS v
WHERE s."id" = v."id"
WINDOW W2 AS (
RANGE INTERVAL '1' SECOND PRECEDING
);
Also I notice that if I restart the Kinesis Analytics application, the SUM values reset, so clearly it doesn't persist across restarts, which might make it unsuitable for this solution. I might have to just setup a SQL server and periodically delete old records.
In general using Streaming Analytics solutions (and Kinesis Analytics in particular) is recommended when you need to do something based on the data in the events and not something external like wall clock time.
The reason is simple: if you need to do something once every 24h, you create a job bringing the data from storage (DB) once, performing your task and then "going to sleep" for another 24h - no complexities, manageable overhead. Now if you need to do something based on the data (e.g. when SUM of some field across multiple events exceeds X) you are in trouble with conventional solution since there is no simple criteria for when it should run. If you run it periodically, it might be invoked many times until the data driven criteria is met, creating a clear overhead.
In the latest case Streaming Analytics solution will be used as designed and trigger your logic just when needed, minimizing the overhead.
If you prefer using Streaming Analytics (which I personally don't recommend based on description of your problem), but struggling with Kinesis Analytics syntax, you might consider using Drools Kinesis Analytics. Among its features are crons and collectors, which provide you with very simple way to trigger jobs on time basis.
Note, that my answer is biased since I'm a CTO at Streamx.

Stream Analytics Output

I have a project that uses an event hub to receive data, this is sent every second, the data is received by a website using SignalR, this is all working fine, i have been storing the data in to blob storage via a Stream Analytics Job, but this is really slow to access, and with the amount of data i am receiving off just 6 devices, it will get even slower as this increases, i need to access the data to display historical data on via graphs on the website, and then this is topped up with the live data coming in.
I don't really need to store the data every second, so thought about only storing it every 30 seconds instead, but into a SQL DB, what i am trying to do, is still receive the data every second but only store it every 30, i have tried a tumbling window, but from what i can see, this just dumps everything every 30 seconds instead of the single entries.
am i miss understanding the Tumbling, Sliding and Hopping windows, i am guessing i cannot use them in this way ? if that is the case, i am guessing the only way to do it, would be to have the output db as an input, so i can cross reference the timestamp with the current time ?
unless anyone has any other ideas ? any help would be appreciated.
Thanks
am i miss understanding the Tumbling, Sliding and Hopping windows
You are correct that this will put all events within the Tumbling/Sliding/Hopping window together. However, this is only valid within a group by case, which requires a aggregate function over this group.
There is a aggregate function Collect() which will create an array of the events within a group.
I think this should be possible when you group every event within a 30 second tumbling window using Collect(), then in the next step, CROSS APPLY each record, which should output all received events within the 30 seconds.
With Grouper AS (
SELECT Collect() AS records
FROM Input TIMESTAMP BY time
GROUP BY TumblingWindow(second, 30)
)
SELECT
record.ArrayValue.FieldA AS FieldA,
record.ArrayValue.FieldB AS FieldB
INTO Output
FROM Grouper
CROSS APPLY GetArrayElements(Grouper.records) AS record
If you are trying to aggregate 30 entries into one summary row every 30 seconds then a tumbling window is a good choice. Something like the following should work:
SELECT System.TimeStamp AS OutTime, TollId, COUNT(*) as cnt, sum(TollCharge) as TollCharge
FROM Input TIMESTAMP BY EntryTime
GROUP BY TollId, TumblingWindow(second, 30)
Thanks for the response, I have been speaking to my contact at Microsoft and he suggested something similar, I had also found something like that in various examples online. what I actually want to do, is only update the database with the data every 30 seconds. so I will receive the event, store it, and I will not store it again until 30 seconds have passed. I am not sure how I can do it with and ASA job to be honest, as I need to have a record of the last time it was updated, I actually have a connection to the event hub from my web site, so in the receiver, I am going to perform a simple check, and then store the data from there.

WSO2 cep: How to deal with multiple joins?

I'm try to solve simple task:
1. I want to correlate occurrence of 3 events A, B, C in case they happen in last 10 seconds.
Thus Siddhi supports only 2 join in query, I think that I'm not able to solve it. In documentation there's suggestion to use multiple queries and join them together like this
from A#window.time(10 sec) as a
join B#window.time(10 sec) as b on a.id == b.id
select a.id
insert into tempA
from tempA#window.time(10 sec) as a
join C#window.time(10 sec) as c on c.id == a.id
select *
insert into finalResult
But this produces wrong results, because data in stream tempA can live longer, time windows are not aligned.
Maybe I'm something missing. Any advice?
Thanks
To solve this, you can try the following approach:
For each incoming event, add a timestamp. (You can also do this in your client itself as well instead of doing it in cep.)
Replace the time windows with external time windows
Use the previously added timestamp field as the reference of time for external time windows
Since the timestamps will be global in this case and all the external time windows will be operating according to them, this should work properly.