New to Siddhi CEP. Other than the regular docs on WS02 CEP can someone point to a good tutorial.
Here are our requirements. Point out some clues on the right ways of writing such queries.
Have a single stream of sensor device notification ( IOT application ).
Stream input is via REST-JSON and output is also to be formatted to REST-JSON. ( Hope this is possible on WS02 CEP 3.1)
Kind of execution plan required:
- If device notification reports usage of Sensor 1, then monitor to see if within 5 mins if device notification reports usage of Sensor 2 also. If found then generate output stream reporting composite-activity back on REST-JSON.
- If such composite-activity is not detected during a time slot in morning, afternoon and evening then generate warning-event-stream status on REST-JSON. ( So how to find events which did not occur in time )
- If such composite-activity is not found within some time slots in morning, afternoon and evening then report failure1-event-stream status back on REST-JSON.
This should work day on day, so how will the previous processed data get deleted in WSO2 CEP.
Regards,
Amit
The queries can be as follows (these are draft queries and may require slight modifications to get them running)
To detect sensor 1, and then sensor 2 within 5 minutes (assuming sensorStram has id, value) you can simply use a pattern like following with the 'within' keyword:
from e1=sensorStream[sensorId == '1'] -> e2=sensorStream[sensorId == '2']
select 'composite activity detected' as description, e1.value as sensor1Value, e2.value as sensor2Value
within 5 minutes
insert into compositeActivityStream;
To detect non occurrences (id=1 arrives, but no id=2 within 5 minutes) we can have following two queries:
from sensorStream[sensorId == '1']#window.time(5 minutes)
select *
insert into delayedSensor1Stream for expired-events;
from e1=sensorStream[sensorId == '1'] -> nonOccurringEvent = sensorStream[sensorId == '2'] or delayedEvent=delayedSensor1Stream
select 'id=2 not found' as description, e1.value as id1Value, nonOccurringEvent.sensorId as nonOccurringId
having (not(nonOccurringId instanceof string))
insert into nonOccurrenceStream;
This will detect non-occurrences immediately at the end of 5 minutes after the arrival of id=1 event.
For an explanation of the above logic, have a look at the non occurrence sample of cep 4.0.0 (the syntax is a bit different, but the same idea)
Now since you need to periodically generate a report, we need another query. For convenience i assume you need a report every 6 hours (360 minutes) and use a time batch window here. Alternatively with the new CEP 4.0.0 you can use the 'Cron window' to generate this at specific times which is better for your use case.
from nonOccurrenceStream#window.timeBatch(360 minutes)
select count(id1Value) as nonOccurrenceCount
insert into nonOccurrenceReportsStream for expired-events;
You can use http input/output adaptors and do json mappings with json builders and formatters for this use case.
Related
Trying to send transaction using python lib https://github.com/petertodd/python-bitcoinlib
We debugged rest API which publishes transaction to to service chain.so and it was successful, also we can see this TX here:
https://chain.so/tx/BTCTEST/98d9f2bc3f65b0ca81f775f43c2f48b6ffe29fcfa06779c7ab299709ea7fc639
DST address is our BitPay testnet wallet,
SRC addresses come generated from python-bitcoinlib using wallet.get_key()
Also we even can't find it on other services.
Also we tried same to to post transaction to bitaps.com and we can't see it there too.
Probably someone can give a clue where to look, what can be wrong and why Confidence on chain.so is 0%. Probably you can recommend other service?
Code is very straightforward:
wallet = wallet_create_or_open(name='MainWallet', network='testnet', witness_type='segwit')
transaction = wallet.send_to(
to_address=body.address,
amount=decimal_to_satoshi(body.amount),
fee='low', offline=True,
)
transaction.send(offline=False)
Found an issue:
Raw hex into https://tbtc.bitaps.com/broadcast but there is an error it shows "Broadcast transaction failed: min relay fee not met, 111 < 141 (code 66)"
Based on the error it seems that I am sending tBTC with a below minimum fee.
Context: We use Kinesis analytics to process our sensor data and find anomalies in the sensor data.
Goal: We need to identify the sensors that didn’t send the data for the past X minutes.
The following methods have been tried with Kinesis analytics SQL, but no luck:
Stagger Window technique works for the first 3 test cases, but doesn't work for test case 4.
CREATE OR REPLACE PUMP "STREAM_PUMP_ALERT_DOSCONNECTION" AS INSERT INTO "INTERMEDIATE_SQL_STREAM" SELECT STREAM "deviceID" as "device_key", count("deviceID") as "device_count", ROWTIME as "time" FROM "INTERMEDIATE_SQL_STREAM_FOR_ROOM"
WINDOWED BY STAGGER (
PARTITION BY "deviceID", ROWTIME RANGE INTERVAL '1' MINUTE);
Left join and group by queries mentioned below doesn't work for all the test cases.
Query 1:
CREATE OR REPLACE PUMP "OUTPUT_STREAM_PUMP" AS
INSERT INTO "INTERMEDIATE_SQL_STREAM_FOR_ROOM2"
SELECT STREAM
ROWTIME as "resultrowtime",
Input2."device_key" as "device_key",
FROM INTERMEDIATE_SQL_STREAM_FOR_ROOM
OVER (RANGE INTERVAL '1' MINUTE PRECEDING) AS Input1
LEFT JOIN INTERMEDIATE_SQL_STREAM_FOR_ROOM AS Input2
ON
Input1."device_key" = Input2."device_key"
AND Input1.ROWTIME <> Input2.ROWTIME;
Query 2:
CREATE OR REPLACE PUMP "OUTPUT_STREAM_PUMP" AS
INSERT INTO "INTERMEDIATE_SQL_STREAM_FOR_ROOM2"
SELECT STREAM
ROWTIME as "resultrowtime",
Input2."device_key" as "device_key"
FROM INTERMEDIATE_SQL_STREAM_FOR_ROOM
OVER (RANGE INTERVAL '1' MINUTE PRECEDING) AS Input1
LEFT JOIN INTERMEDIATE_SQL_STREAM_FOR_ROOM AS Input2
ON
Input1."device_key" = Input2."device_key"
AND TSDIFF(Input1, Input2) > 0;
Query 3:
CREATE OR REPLACE PUMP "OUTPUT_STREAM_PUMP" AS
INSERT INTO "INTERMEDIATE_SQL_STREAM_FOR_ROOM2"
SELECT STREAM
ROWTIME as "resultrowtime",
Input2."device_key" as "device_key"
FROM INTERMEDIATE_SQL_STREAM_FOR_ROOM
OVER (RANGE INTERVAL '1' MINUTE PRECEDING) AS Input1
LEFT JOIN INTERMEDIATE_SQL_STREAM_FOR_ROOM AS Input2
ON
Input1."device_key" = Input2."device_key"
AND Input1.ROWTIME = Input2.ROWTIME;
CREATE OR REPLACE PUMP "OUTPUT_STREAM_PUMP2" AS
INSERT INTO "DIS_CONN_DEST_SQL_STREAM_ALERT"
SELECT STREAM "device_key", "count"
FROM (
SELECT STREAM
"device_key",
COUNT(*) as "count"
FROM INTERMEDIATE_SQL_STREAM_FOR_ROOM2
GROUP BY FLOOR(INTERMEDIATE_SQL_STREAM_FOR_ROOM2.ROWTIME TO MINUTE), "device_key"
)
WHERE "count" = 1;
Here are test cases for your reference:
Test case 1:
Device 1 and Device 2 send data continuously to the Kinesis
Analytics.
After X minutes, Device 2 continues to send the data,
but device 1 is not sending the data.
Output for test case 1:
We want the Kinesis Analytics to pop out Device 1, so that we know which device is not sending data.
Test case 2 (Interval - 10 minutes)
Device 1 sends data at 09:00
Device 2 sends data at 09:02
Device 2 again sends the data at 09:11, but Device 1 doesn’t send any data.
Output for test case 2:
We want the Kinesis Analytics to pop out Device 1, so that we know which device is not sending data.
Test case 3 (Interval - 10 minutes)
Device 1 and device 2 send data continuously to kinesis analytics.
Both devices (1 & 2) don't send any data for the next 15 minutes.
Output for test case 3:
We want the Kinesis Analytics to pop out Device 1 & Device 2, so that we know which devices are not sending data.
Test case 4: (Interval - 10 mins)
Device 1 sends data at 09:00
Device 2 sends data at 09:02
Device 1 again sends data at 09:04
Device 2 again sends data at 09:06
Then no data
Output for test case 4:
We want the analytics to pop out device 1 at 09:14 and pop out device 2 at 09:16. So that we can get the disconnected devices(i.e devices not sending data) after the exact interval.
Note: AWS Support directed us to simple queries that don't answer the question. Looks like they can help with the exact query only if we upgrade our support plan.
I'm not familiar with all of the ways in which AWS has extended or modified Apache Flink, but open source Flink doesn't provide a simple way to detect that all sources have ceased to send data. One solution is to use something like a process function with processing-time timers to detect the absence of data.
The documentation has an example of something along these lines: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/process_function/#example
I'm trying to play with Kafka Stream to aggregate some attribute of People.
I have a kafka stream test like this :
new ConsumerRecordFactory[Array[Byte], Character]("input", new ByteArraySerializer(), new CharacterSerializer())
var i = 0
while (i != 5) {
testDriver.pipeInput(
factory.create("input",
Character(123,12), 15*10000L))
i+=1;
}
val output = testDriver.readOutput....
I'm trying to group the value by key like this :
streamBuilder.stream[Array[Byte], Character](inputKafkaTopic)
.filter((key, _) => key == null )
.mapValues(character=> PersonInfos(character.id, character.id2, character.age) // case class
.groupBy((_, value) => CharacterInfos(value.id, value.id2) // case class)
.count().toStream.print(Printed.toSysOut[CharacterInfos, Long])
When i'm running the code, I got this :
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 1
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 2
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 3
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 4
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 5
Why i'm getting 5 rows instead of just one line with CharacterInfos and the count ?
Doesn't groupBy just change the key ?
If you use the TopologyTestDriver caching is effectively disabled and thus, every input record will always produce an output record. This is by design, because caching implies non-deterministic behavior what makes itsvery hard to write an actual unit test.
If you deploy the code in a real application, the behavior will be different and caching will reduce the output load -- which intermediate results you will get, is not defined (ie, non-deterministic); compare Michael Noll's answer.
For your unit test, it should actually not really matter, and you can either test for all output records (ie, all intermediate results), or put all output records into a key-value Map and only test for the last emitted record per key (if you don't care about the intermediate results) in the test.
Furthermore, you could use suppress() operator to get fine grained control over what output messages you get. suppress()—in contrast to caching—is fully deterministic and thus writing a unit test works well. However, note that suppress() is event-time driven, and thus, if you stop sending new records, time does not advance and suppress() does not emit data. For unit testing, this is important to consider, because you might need to send some additional "dummy" data to trigger the output you actually want to test for. For more details on suppress() check out this blog post: https://www.confluent.io/blog/kafka-streams-take-on-watermarks-and-triggers
Update: I didn't spot the line in the example code that refers to the TopologyTestDriver in Kafka Streams. My answer below is for the 'normal' KStreams application behavior, whereas the TopologyTestDriver behaves differently. See the answer by Matthias J. Sax for the latter.
This is expected behavior. Somewhat simplified, Kafka Streams emits by default a new output record as soon as a new input record was received.
When you are aggregating (here: counting) the input data, then the aggregation result will be updated (and thus a new output record produced) as soon as new input was received for the aggregation.
input record 1 ---> new output record with count=1
input record 2 ---> new output record with count=2
...
input record 5 ---> new output record with count=5
What to do about it: You can reduce the number of 'intermediate' outputs through configuring the size of the so-called record caches as well as the setting of the commit.interval.ms parameter. See Memory Management. However, how much reduction you will be seeing depends not only on these settings but also on the characteristics of your input data, and because of that the extent of the reduction may also vary over time (think: could be 90% in the first hour of data, 76% in the second hour of data, etc.). That is, the reduction process is deterministic but from the resulting reduction amount is difficult to predict from the outside.
Note: When doing windowed aggregations (like windowed counts) you can also use the Suppress() API so that the number of intermediate updates is not only reduced, but there will only ever be a single output per window. However, in your use case/code you the aggregation is not windowed, so cannot use the Suppress API.
To help you understand why the setup is this way: You must keep in mind that a streaming system generally operates on unbounded streams of data, which means the system doesn't know 'when it has received all the input data'. So even the term 'intermediate outputs' is actually misleading: at the time the second input record was received, for example, the system believes that the result of the (non-windowed) aggregation is '2' -- its the correct result to the best of its knowledge at this point in time. It cannot predict whether (or when) another input record might arrive.
For windowed aggregations (where Suppress is supported) this is a bit easier, because the window size defines a boundary for the input data of a given window. Here, the Suppress() API allows you to make a trade-off decision between better latency but with multiple outputs per window (default behavior, Suppress disabled) and longer latency but you'll get only a single output per window (Suppress enabled). In the latter case, if you have 1h windows, you will not see any output for a given window until 1h later, so to speak. For some use cases this is acceptable, for others it is not.
I am using CEP to check if an event has arrived within a specified amount of time (lets say 1 min). If not, I want to publish an alert.
More specifically, a (server) machine generates a heartbeat data stream and sends it to CEP. The heartbeat stream contains the server id and a timestamp. An alert should be generated if no heartbeat data arrive within the 1 min period.
Is it possible to do something like that with CEP? I have seen other questions regarding the detection of non-occurencies but I am still not sure how to approach the scenario described above.
You can try this :
define stream heartbeats (serverId string, timestamp long);
from heartbeats#window.time(1 minute) insert expired events into delayedStream;
from every e = heartbeats -> e2 = hearbeats[serverId == e.serverId]
or expired = delayedStream[serverId == e.serverId]
within 1 minute
select e.serverId, e2.serverId as id2, expired.serverId as id3
insert into tmpStream;
// every event on tmpStream with a 'expired' match has timeout
from tmpStream[id3 is not null]
select serverId
insert into expiredHearbeats;
This is our set up, we have one topic and this topic has two subscriptions v1 and v2 with exact identical settings, both are pull subscription with 10 sec ack deadline.
both subscription v1 and v2 goes to separate dedicated dataflow where v2's data flow is more optimized but pretty much doing same thing.
Problem is that every now and then we see below warning messages and backlog starts build up in v2 subscription only and v1 shows little to no backlog.
08:53:56.000 ${MESSAGE_ID} Pubsub processing delay was high at 72 sec.
Dataflow log in v2 shows nothing obvious except above messages. In fact v2 dataflow cpu usage is lower than v1 so I cant make any sense out of this.
Questions:
What causes processing delay and how can I fix it?
Why isn't v1 subscription getting same warnings?
Updated at 2017/01/17
As suggested via #ben it seems that ParDo filtering operation we do right after PubSub read is hitting unexpectedly high latency. But considering getClassroomIds is a simple java list I'm not sure how I can tackle this problem. One question is that is coder we have applied to pubsub lazy? Is unzipping and deserializing we have defined in coder applied when ProcessContext#element() is called?
def processElement(c: DoFn[Entity, Entity]#ProcessContext) = {
val startTime = System.currentTimeMillis()
val entity = c.element()
if (!entity.getClassroomIds.isEmpty) {
c.output(entity)
}
val latencyMs = System.currentTimeMillis() - startTime
if (latencyMs > 1000) {
// We see this warning messages during the load spike
log.warn(s"latency breached 1 second threshold: $latencyMs ms")
}
}
The timing you mention isn't accurately accounting for the time spent in that step. In particular, due to the fusion optimization it reflects all of the ParDos after the filtering operation.
If you have a pipeline which looks like:
ReadFromPubSub -> ParDo(Filter) -> ParDo(Expensive) -> ParDo(Write)
Both Expensive and Write execute on each element coming out of the Filter before the call to c.output returns. This is further a problem because it is fused into the elements coming out of PubSub.
The easiest solution is probably to perform a Reshuffle:
pipeline
.apply(PubSubIO.read(...))
.apply(keyEachElement by their value and Void)
.apply(new Reshuffle<E, Void>())
.apply(move the key back to the element)
// rest of the pipeline
Note that using the Reshuffle instead of a regular GroupByKey has nice properties since it triggers faster than any normally trigger will fire.