Anybody can give me a example about how to use window.frequent?
For example,
I write a test,
"define stream cseEventStream (symbol string, price float, time long);" +
"" +
"#info(name = 'query1') " +
"from cseEventStream[700 > price]#window.frequent(3, symbol) " +
"select symbol, price, time " +
"insert expired events into outputStream;";
But from the outputStream, i can't find out the rule.
Thanks.
In this particular query 'window.frequent(3, symbol)' will make the query to find the most frequent 3 symbols(or 3 symbols that has the highest number of occurrences). But, when you insert events to outputStream you have inserted only expired events. So that, as the end result this query will output events that are expired from the frequent window.
In a frequent window, expired events are events that are not belonging to a frequent group anymore. In this case events which are the symbol is not among 3 symbols that has the highest number of occurrences.
for an example if you send the following sequence of events,
{"symbolA", 71.36f, 100}
{"symbolB", 72.36f, 100}
{"symbolB", 74.36f, 100}
{"symbolC", 73.36f, 100}
{"symbolC", 76.36f, 100}
{"symbolD", 76.36f, 100}
{"symbolD", 76.36f, 100}
The query will output {"symbolA", 71.36f, 100}.
When you send the events with 'symbolD'. SymbolA will not be among the top3 symbols with highest number of occurrences anymore so that event with symbolA is expired and {"symbolA", 71.36f, 100} is emitted.
For every event which has a price > 700, this window will retain most frequent 3 items based on symbol and since the output type is 'expired events' you will only receive output once an event loose it's position as a frequent event.
Ex: for frequent window of size 2
Input
WSO2 1000 1
WSO2 1000 2
ABC 700 3
XYZ 800 4
Output
ABC 700 3
ABC event was in the frequent window and was expired upon receiving of XYZ event. If you use default output which is 'current events' it will output all incoming events which are selected as frequent events and put into the window.
Implementation is based on Misra-Gries counting algorithm.
Documentation : https://docs.wso2.com/display/CEP400/Inbuilt+Windows#InbuiltWindows-frequent
Test cases : https://github.com/wso2/siddhi/blob/master/modules/siddhi-core/src/test/java/org/wso2/siddhi/core/query/window/FrequentWindowTestCase.java
Related
I'm trying to calculate some stats in a sliding window interval
from CsvInputFileStreamWithConvertedTimestamp#window.externalTime(time, 250 milliseconds)
select time as timeslice, time:dateFormat(time, 'yyyy-MM-dd') as date, time:dateFormat(time, 'HH:mm:ss') as time, instrument, sum(fin) / sum(quantity) as vwap, max(price) * min(price)-1 as prange, max(price) as prangemax, min(price) as prangemin, sum(quantity) as totalquant, avg(quantity) as avgquant, 0 as medquant, sum(fin) as totalfin, avg(fin) as avgfin, count() as trades, distinctCount(buyer) as nofbuy, distinctCount(seller) as nofsell, cast(distinctCount(seller), 'double') / distinctCount(buyer) as bsratio, count(buyer) as buyaggr, count(seller) as sellaggr, sum(quantity) as totalblockquant
insert expired events into OutputStream;
But in the output most values are null
This is my input data
Any idea what i'm doing wrong here
Solved.
I was using insert expired events into instead of insert all events into
I have a requirement to collect multiple events into a single one and consider it as a single event. The output stream should include the input events as a list.
Basically, when the following events are passed
{InvoiceNumber:1, LineItem: 1, Value: 100}
{InvoiceNumber:1, LineItem: 2, Value: 200}
{InvoiceNumber:1, LineItem: 3, Value: 300}
I need the output to be like the following
[InvoiceNumber:1, lineItems: [{LineItem: 1, Value: 100}, {LineItem: 2, Value: 200}, {LineItem: 3, Value: 300}]
How do I achieve this with WSO2 Streaming Integrator? Or Siddhi.io.
I attempted the following but it still inserting each stream into the output stream
partition with (InvoiceNo of CSVInputStream)
begin
from every e1=CSVInputStream
within 1 min
select e1.InvoiceNo, list:collect(e1.LineItem) as lineItems
group by e1.InvoiceNo
insert into AggregatedStream;
end;
Do not use partitions, as this is a simple use case, try windows. Time batch windows in your case, https://siddhi.io/en/v5.1/docs/api/latest/#timebatch-window
from CSVInputStream#window.timeBatch(1 min)
select e1.InvoiceNo, list:collect(e1.LineItem) as lineItems
group by e1.InvoiceNo
insert into AggregatedStream;
After reading data from an unbounded source like pub/sub, I'm applying windowing. I need to write all the records belong to a window to a separate file. I found this in Java but couldn't find anything in python.
There are no details about your use case in the question so you might need to adapt some parts of the following example. One way to do it is to group elements using as key the window that they belong to. Then, we leverage filesystems.FileSystems.create to control how do we want to write the files.
Here I will be using 10s windows and some dummy data where events are separated 4s each. Generated with:
data = [{'event': '{}'.format(event), 'timestamp': time.time() + 4*event} for event in range(10)]
We use the timestamp field to assign element timestamp (this is just to emulate Pub/Sub events in a controlled way). We window the events, use the windowing info as the key, group by key and write the results to the output folder:
events = (p
| 'Create Events' >> beam.Create(data) \
| 'Add Timestamps' >> beam.Map(lambda x: beam.window.TimestampedValue(x, x['timestamp'])) \
| 'Add Windows' >> beam.WindowInto(window.FixedWindows(10)) \
| 'Add Window Info' >> beam.ParDo(AddWindowingInfoFn()) \
| 'Group By Window' >> beam.GroupByKey() \
| 'Windowed Writes' >> beam.ParDo(WindowedWritesFn('output/')))
Where AddWindowingInfoFn is pretty straightforward:
class AddWindowingInfoFn(beam.DoFn):
"""output tuple of window(key) + element(value)"""
def process(self, element, window=beam.DoFn.WindowParam):
yield (window, element)
and WindowedWritesFn writes to the path that we specified in the pipeline (output/ folder in my case). Then, I use the window info for the name of the file. For convenience, I convert the epoch timestamps to human-readable dates. Finally, we iterate over all the elements and write them to the corresponding file. Of course, this behavior can be tuned at will in this function:
class WindowedWritesFn(beam.DoFn):
"""write one file per window/key"""
def __init__(self, outdir):
self.outdir = outdir
def process(self, element):
(window, elements) = element
window_start = str(window.start.to_utc_datetime()).replace(" ", "_")
window_end = str(window.end.to_utc_datetime()).replace(" ", "_")
writer = filesystems.FileSystems.create(self.outdir + window_start + ',' + window_end + '.txt')
for row in elements:
writer.write(str(row)+ "\n")
writer.close()
This will write elements belonging to each window to a different file. In my case there are 5 different ones
$ ls output/
2019-05-21_19:01:20,2019-05-21_19:01:30.txt
2019-05-21_19:01:30,2019-05-21_19:01:40.txt
2019-05-21_19:01:40,2019-05-21_19:01:50.txt
2019-05-21_19:01:50,2019-05-21_19:02:00.txt
2019-05-21_19:02:00,2019-05-21_19:02:10.txt
The first one contains only element 0 (this will vary between executions):
$ cat output/2019-05-21_19\:01\:20\,2019-05-21_19\:01\:30.txt
{'timestamp': 1558465286.933727, 'event': '0'}
The second one contains elements 1 to 3 and so on:
$ cat output/2019-05-21_19\:01\:30\,2019-05-21_19\:01\:40.txt
{'timestamp': 1558465290.933728, 'event': '1'}
{'timestamp': 1558465294.933728, 'event': '2'}
{'timestamp': 1558465298.933729, 'event': '3'}
Caveat from this approach is that all elements from the same window are grouped into the same worker. This would happen anyway if writing to a single shard or output file as per your case but, for higher loads, you might need to consider larger machine types.
Full code here
I have data coming in from 3 different servers(3 data streams). Is it possible to merge these data (kind of update/upsert) in kinesis consumer application and get the updated data as output?
The data I have from stream 2,3 is dependent on stream 1. For example
Stream 1 (ItemDetails) - {itemId, itemName, itemdescription}, stream
2 (ItemtoCart) - {ItemId}, stream 3 (Itemordered) - {ItemId}.
Final Stream output I am expecting is
OutputStream - {itemId, itemName, itemdescription, itemtoCart_flag,
itemOrdered_flag}
Stream 1 is receiving the data at the rate of 10K records/sec.
Say there are three streams as below,
stream event in stream
stream1(ItemPurchased) - {"item" : 1, "totalQuantity": 100}
stream2(ItemOrdered) - {"item" : 1, "sold": 1}
stream3(ItemCancelled) - {"item" : 1, "orderCancelled": 1}
The streams are for Item purchase, then sold and or cancelled.
Say, I wanna build a final state of item available quantity from these events.
What I would do is,
consume each event in stream / kinesis has lambda functionality but not sure how easily it talks to non aws datastores like MongoDB, Cassandra
and have a logic to build a final state based on event
State transition table
stream events consumer/onEvent state (could be MongoDB, Cassandra)
stream1(ItemPurchased) - {"item" : 1, "totalQuantity": 100} -> create new state -> {"item" : 1, "availableQuantity": 100}
stream2(ItemOrdered) - {"item" : 1, "sold": 1} -> decrease the quantity -> {"item" : 1, "availableQuantity": 100 - 1}
stream3(ItemCancelled) - {"item" : 1, orderCancelled: 1} -> increase the quantity -> {"item" : 1, "availableQuantity": 99 + 1}
Hope that answers your question, but unlike you asked it is the final state table not a stream.
I am using optimizer in Pyalgotrade to run my strategy to find the best parameters. The message I get is this:
2015-04-09 19:33:35,545 broker.backtesting [DEBUG] Not enough cash to fill 600800 order [1681] for 888 share/s
2015-04-09 19:33:35,546 broker.backtesting [DEBUG] Not enough cash to fill 600800 order [1684] for 998 share/s
2015-04-09 19:33:35,547 server [INFO] Partial result 7160083.45 with parameters: ('600800', 4, 19) from worker-16216
2015-04-09 19:33:36,049 server [INFO] Best final result 7160083.45 with parameters: ('600800', 4, 19) from client worker-16216
This is just part of the message. You can see only for parameters ('600800', 4, 19) and ('600800', 4, 19) we have result, for other combination of parameters, I get the message : 546 broker.backtesting [DEBUG] Not enough cash to fill 600800 order [1684] for 998 share/s.
I think this message means that I have created a buy order but I do not have enough cash to busy it. However, from my script below:
shares = self.getBroker().getShares(self.__instrument)
if bars[self.__instrument].getPrice() > up and shares == 0:
sharesToBuy = int(self.getBroker().getCash()/ bars[self.__instrument].getPrice())
self.marketOrder(self.__instrument, sharesToBuy)
if shares != 0 and bars[self.__instrument].getPrice() > up_stop:
self.marketOrder(self.__instrument, -1 * shares)
if shares != 0 and bars[self.__instrument].getPrice() < up:
self.marketOrder(self.__instrument, -1 * shares)
The logic of my strategy is that is the current price is larger than up, we buy, and if the current price is larger than up_stop or smaller than up after we buy, we sell. So from the code, there is no way that I will generate an order which I do not have enough cash to pay because the order is calculated by my current cash.
So where do I get wrong?
You calculate the order size based on the current price, but the price for the next bar may have gone up. The order is not filled in the current bar, but starting from the next bar.
With respect to the 'Partial result' and 'Best final result' messages, how many combinations of parameters are you trying ? Note that if you are using 10 different combinations, you won't get 10 different 'Partial result' because they are evaluated in batches of 200 combinations and only the best partial result for each batch of 200 combinations gets printed.