Select events with a maximum in a sliding window

Select events with a maximum in a sliding window - wso2

I have this stream :
define stream locationStream (cell string, device string, power long);
I want to select in this stream, with a sliding windows of 10 seconds, for every device, the value of the 'cell' attribute for which 'power' is the largest.
What queries should I use to get this result with Siddhi ? Something like
from locationStream#window.time(10 seconds)
select max(power), device, <cell where power = max(power)>
group by device
insert all events into cellStream

You can use Siddhi maxByTimeWindow offered through extrema extension. Usage is documented in shared resources. You will have to use this with a partition to get per device max. Suggested query should look like below.
partition with ( device of locationStream )
begin
from locationStream#extrema:maxByTime(power, 10 sec)
select power, device, cell
insert events into cellStream
end;

Related

Kinesis Analytics SQL query to narrow down the sensors that are not sending data

Context: We use Kinesis analytics to process our sensor data and find anomalies in the sensor data.
Goal: We need to identify the sensors that didn’t send the data for the past X minutes.
The following methods have been tried with Kinesis analytics SQL, but no luck:
Stagger Window technique works for the first 3 test cases, but doesn't work for test case 4.
CREATE OR REPLACE PUMP "STREAM_PUMP_ALERT_DOSCONNECTION" AS INSERT INTO "INTERMEDIATE_SQL_STREAM" SELECT STREAM "deviceID" as "device_key", count("deviceID") as "device_count", ROWTIME as "time" FROM "INTERMEDIATE_SQL_STREAM_FOR_ROOM"
WINDOWED BY STAGGER (
PARTITION BY "deviceID", ROWTIME RANGE INTERVAL '1' MINUTE);
Left join and group by queries mentioned below doesn't work for all the test cases.
Query 1:
CREATE OR REPLACE PUMP "OUTPUT_STREAM_PUMP" AS
INSERT INTO "INTERMEDIATE_SQL_STREAM_FOR_ROOM2"
SELECT STREAM
ROWTIME as "resultrowtime",
Input2."device_key" as "device_key",
FROM INTERMEDIATE_SQL_STREAM_FOR_ROOM
OVER (RANGE INTERVAL '1' MINUTE PRECEDING) AS Input1
LEFT JOIN INTERMEDIATE_SQL_STREAM_FOR_ROOM AS Input2
ON
Input1."device_key" = Input2."device_key"
AND Input1.ROWTIME <> Input2.ROWTIME;
Query 2:
CREATE OR REPLACE PUMP "OUTPUT_STREAM_PUMP" AS
INSERT INTO "INTERMEDIATE_SQL_STREAM_FOR_ROOM2"
SELECT STREAM
ROWTIME as "resultrowtime",
Input2."device_key" as "device_key"
FROM INTERMEDIATE_SQL_STREAM_FOR_ROOM
OVER (RANGE INTERVAL '1' MINUTE PRECEDING) AS Input1
LEFT JOIN INTERMEDIATE_SQL_STREAM_FOR_ROOM AS Input2
ON
Input1."device_key" = Input2."device_key"
AND TSDIFF(Input1, Input2) > 0;
Query 3:
CREATE OR REPLACE PUMP "OUTPUT_STREAM_PUMP" AS
INSERT INTO "INTERMEDIATE_SQL_STREAM_FOR_ROOM2"
SELECT STREAM
ROWTIME as "resultrowtime",
Input2."device_key" as "device_key"
FROM INTERMEDIATE_SQL_STREAM_FOR_ROOM
OVER (RANGE INTERVAL '1' MINUTE PRECEDING) AS Input1
LEFT JOIN INTERMEDIATE_SQL_STREAM_FOR_ROOM AS Input2
ON
Input1."device_key" = Input2."device_key"
AND Input1.ROWTIME = Input2.ROWTIME;
CREATE OR REPLACE PUMP "OUTPUT_STREAM_PUMP2" AS
INSERT INTO "DIS_CONN_DEST_SQL_STREAM_ALERT"
SELECT STREAM "device_key", "count"
FROM (
SELECT STREAM
"device_key",
COUNT(*) as "count"
FROM INTERMEDIATE_SQL_STREAM_FOR_ROOM2
GROUP BY FLOOR(INTERMEDIATE_SQL_STREAM_FOR_ROOM2.ROWTIME TO MINUTE), "device_key"
)
WHERE "count" = 1;
Here are test cases for your reference:
Test case 1:
Device 1 and Device 2 send data continuously to the Kinesis
Analytics.
After X minutes, Device 2 continues to send the data,
but device 1 is not sending the data.
Output for test case 1:
We want the Kinesis Analytics to pop out Device 1, so that we know which device is not sending data.
Test case 2 (Interval - 10 minutes)
Device 1 sends data at 09:00
Device 2 sends data at 09:02
Device 2 again sends the data at 09:11, but Device 1 doesn’t send any data.
Output for test case 2:
We want the Kinesis Analytics to pop out Device 1, so that we know which device is not sending data.
Test case 3 (Interval - 10 minutes)
Device 1 and device 2 send data continuously to kinesis analytics.
Both devices (1 & 2) don't send any data for the next 15 minutes.
Output for test case 3:
We want the Kinesis Analytics to pop out Device 1 & Device 2, so that we know which devices are not sending data.
Test case 4: (Interval - 10 mins)
Device 1 sends data at 09:00
Device 2 sends data at 09:02
Device 1 again sends data at 09:04
Device 2 again sends data at 09:06
Then no data
Output for test case 4:
We want the analytics to pop out device 1 at 09:14 and pop out device 2 at 09:16. So that we can get the disconnected devices(i.e devices not sending data) after the exact interval.
Note: AWS Support directed us to simple queries that don't answer the question. Looks like they can help with the exact query only if we upgrade our support plan.

I'm not familiar with all of the ways in which AWS has extended or modified Apache Flink, but open source Flink doesn't provide a simple way to detect that all sources have ceased to send data. One solution is to use something like a process function with processing-time timers to detect the absence of data.
The documentation has an example of something along these lines: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/process_function/#example

Power Query Excel. Read and store variable from worksheet

I'm sorry for this poor title but can't formulate correct title at the time.
What is the process:
Get message id aka offset from a sheet.
Get list of updates from telegram based in this offset.
Store last message id which will be used as offset in p.1
Process updates.
Offset allows you to fetch updates only received after this offset.
E.g i get 10 messages 1st time and last message id aka offset is 100500. Before 2nd run bot received additional 10 message so there's 20 in total. To not load all of 20 messages (10 of which i already processed) i need to specify offset from 1st run, so API will return only last 10 messages.
PQ is run in Excel.
let
// Offset is number read from table in Excel. Let's say it's 10.
Offset = 10
// Returns list of messages as JSON objects.
Updates = GetUpdates(Offset),
// This update_id will be used as offset in next query run.
LastMessageId = List.Last(Updates)[update_id],
// Map Process function to each item in the list of update JSON objects.
Map = List.Transform(Updates, each Process(_))
in
Map
The issue is that i need to store/read this offset number each time query is executed AND process updates.
Because of lazy evaluation based on code below i can either output LastMessageId from the query or output result of a Map function.
The question is: how can i do both things store/load LastMessageId from Updates and process those updates.
Thank you.

Siddhi check if an event does not arrive within a specified time window?

I am using CEP to check if an event has arrived within a specified amount of time (lets say 1 min). If not, I want to publish an alert.
More specifically, a (server) machine generates a heartbeat data stream and sends it to CEP. The heartbeat stream contains the server id and a timestamp. An alert should be generated if no heartbeat data arrive within the 1 min period.
Is it possible to do something like that with CEP? I have seen other questions regarding the detection of non-occurencies but I am still not sure how to approach the scenario described above.

You can try this :
define stream heartbeats (serverId string, timestamp long);
from heartbeats#window.time(1 minute) insert expired events into delayedStream;
from every e = heartbeats -> e2 = hearbeats[serverId == e.serverId]
or expired = delayedStream[serverId == e.serverId]
within 1 minute
select e.serverId, e2.serverId as id2, expired.serverId as id3
insert into tmpStream;
// every event on tmpStream with a 'expired' match has timeout
from tmpStream[id3 is not null]
select serverId
insert into expiredHearbeats;

Long lived state with Google Dataflow

Just trying to get my head around the programming model here. Scenario is I'm using Pub/Sub + Dataflow to instrument analytics for a web forum. I have a stream of data coming from Pub/Sub that looks like:
ID | TS | EventType
1 | 1 | Create
1 | 2 | Comment
2 | 2 | Create
1 | 4 | Comment
And I want to end up with a stream coming from Dataflow that looks like:
ID | TS | num_comments
1 | 1 | 0
1 | 2 | 1
2 | 2 | 0
1 | 4 | 2
I want the job that does this rollup to run as a stream process, with new counts being populated as new events come in. My question is, where is the idiomatic place for the job to store the state for the current topic id and comment counts? Assuming that topics can live for years. Current ideas are:
Write a 'current' entry for the topic id to BigTable and in a DoFn query what the current comment count for the topic id is coming in. Even as I write this I'm not a fan.
Use side inputs somehow? It seems like maybe this is the answer, but if so I'm not totally understanding.
Set up a streaming job with a global window, with a trigger that goes off every time it gets a record, and rely on Dataflow to keep the entire pane history somewhere. (unbounded storage requirement?)
EDIT: Just to clarify, I wouldn't have any trouble implementing any of these three strategies, or a million different other ways of doing it, I'm more interested in what is the best way of doing it with Dataflow. What will be most resilient to failure, having to re-process history for a backfill, etc etc.
EDIT2: There is currently a bug with the dataflow service where updates fail if adding inputs to a flatten transformation, which will mean you'll need to discard and rebuild any state accrued in the job if you make a change to a job that includes adding something to a flatten operation.

You should be able to use triggers and a combine to accomplish this.
PCollection<ID> comments = /* IDs from the source */;
PCollection<KV<ID, Long>> commentCounts = comments
// Produce speculative results by triggering as data comes in.
// Note that this won't trigger after *every* element, but it will
// trigger relatively quickly (as the system divides incoming data
// into work units). You could also throttle this with something
// like:
// AfterProcessingTime.pastFirstElementInPane()
// .plusDelayOf(Duration.standardMinutes(5))
// which will produce output every 5 minutes
.apply(Window.triggering(
Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
.accumulatingFiredPanes())
// Count the occurrences of each ID
.apply(Count.perElement());
// Produce an output String -- in your use case you'd want to produce
// a row and write it to the appropriate source
commentCounts.apply(new DoFn<KV<ID, Long>, String>() {
public void processElement(ProcessContext c) {
KV<ID, Long> element = c.element();
// This includes details about the pane of the window being
// processed, and including a strictly increasing index of the
// number of panes that have been produced for the key.
PaneInfo pane = c.pane();
return element.key() + " | " + pane.getIndex() + " | " + element.value();
}
});
Depending on your data, you could also read whole comments from the source, extract the ID, and then use Count.perKey() to get the counts for each ID. If you want a more complicated combination, you could look at defining a custom CombineFn and using Combine.perKey.

Since BigQuery does not support overwriting rows, one way to go about this is to write the events to BigQuery, and query the data using COUNT:
SELECT ID, COUNT(num_comments) from Table GROUP BY ID;
You can also do per-window aggregations of num_comments within Dataflow before writing the entries to BigQuery; the query above will continue to work.

WS02 CEP Siddhi Queries

New to Siddhi CEP. Other than the regular docs on WS02 CEP can someone point to a good tutorial.
Here are our requirements. Point out some clues on the right ways of writing such queries.
Have a single stream of sensor device notification ( IOT application ).
Stream input is via REST-JSON and output is also to be formatted to REST-JSON. ( Hope this is possible on WS02 CEP 3.1)
Kind of execution plan required:
- If device notification reports usage of Sensor 1, then monitor to see if within 5 mins if device notification reports usage of Sensor 2 also. If found then generate output stream reporting composite-activity back on REST-JSON.
- If such composite-activity is not detected during a time slot in morning, afternoon and evening then generate warning-event-stream status on REST-JSON. ( So how to find events which did not occur in time )
- If such composite-activity is not found within some time slots in morning, afternoon and evening then report failure1-event-stream status back on REST-JSON.
This should work day on day, so how will the previous processed data get deleted in WSO2 CEP.
Regards,
Amit

The queries can be as follows (these are draft queries and may require slight modifications to get them running)
To detect sensor 1, and then sensor 2 within 5 minutes (assuming sensorStram has id, value) you can simply use a pattern like following with the 'within' keyword:
from e1=sensorStream[sensorId == '1'] -> e2=sensorStream[sensorId == '2']
select 'composite activity detected' as description, e1.value as sensor1Value, e2.value as sensor2Value
within 5 minutes
insert into compositeActivityStream;
To detect non occurrences (id=1 arrives, but no id=2 within 5 minutes) we can have following two queries:
from sensorStream[sensorId == '1']#window.time(5 minutes)
select *
insert into delayedSensor1Stream for expired-events;
from e1=sensorStream[sensorId == '1'] -> nonOccurringEvent = sensorStream[sensorId == '2'] or delayedEvent=delayedSensor1Stream
select 'id=2 not found' as description, e1.value as id1Value, nonOccurringEvent.sensorId as nonOccurringId
having (not(nonOccurringId instanceof string))
insert into nonOccurrenceStream;
This will detect non-occurrences immediately at the end of 5 minutes after the arrival of id=1 event.
For an explanation of the above logic, have a look at the non occurrence sample of cep 4.0.0 (the syntax is a bit different, but the same idea)
Now since you need to periodically generate a report, we need another query. For convenience i assume you need a report every 6 hours (360 minutes) and use a time batch window here. Alternatively with the new CEP 4.0.0 you can use the 'Cron window' to generate this at specific times which is better for your use case.
from nonOccurrenceStream#window.timeBatch(360 minutes)
select count(id1Value) as nonOccurrenceCount
insert into nonOccurrenceReportsStream for expired-events;
You can use http input/output adaptors and do json mappings with json builders and formatters for this use case.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js