WSO2 Streaming Integrator - Siddhi.io Aggregate multiple events into one event - wso2

I have a requirement to collect multiple events into a single one and consider it as a single event. The output stream should include the input events as a list.
Basically, when the following events are passed
{InvoiceNumber:1, LineItem: 1, Value: 100}
{InvoiceNumber:1, LineItem: 2, Value: 200}
{InvoiceNumber:1, LineItem: 3, Value: 300}
I need the output to be like the following
[InvoiceNumber:1, lineItems: [{LineItem: 1, Value: 100}, {LineItem: 2, Value: 200}, {LineItem: 3, Value: 300}]
How do I achieve this with WSO2 Streaming Integrator? Or Siddhi.io.
I attempted the following but it still inserting each stream into the output stream
partition with (InvoiceNo of CSVInputStream)
begin
from every e1=CSVInputStream
within 1 min
select e1.InvoiceNo, list:collect(e1.LineItem) as lineItems
group by e1.InvoiceNo
insert into AggregatedStream;
end;

Do not use partitions, as this is a simple use case, try windows. Time batch windows in your case, https://siddhi.io/en/v5.1/docs/api/latest/#timebatch-window
from CSVInputStream#window.timeBatch(1 min)
select e1.InvoiceNo, list:collect(e1.LineItem) as lineItems
group by e1.InvoiceNo
insert into AggregatedStream;

Related

How to fix error "404" in Power BI M query

I am getting error code 404 in PowerBI.
Sample Data:
ID, text
1, #VirginAmerica What #dhepburn said.
2, #VirginAmerica plus you've added commercials to the experience..
3, #VirginAmerica I didn't today... Must mean I need to take another
4, "#VirginAmerica it's really aggressive to blast obnoxious ""entert
5, #VirginAmerica and it's a really big bad thing about it
I am trying to write sentiment score in PowerBI using M query in Microsoft cognitive services.
below is my query.
Query: (Source as table) as any =>
let JsonRecords = Text.FromBinary(Json.FromValue(Source)),
JsonRequest = “{“”documents””: ” & JsonRecords & “}”,
JsonContent = Text.ToBinary(JsonRequest, TextEncoding.Ascii),
Response =
Web.Contents(“https://westcentralus.api.cognitive.microsoft.com/text/analytics”,
[
Headers = [#”Ocp-Apim-Subscription-Key”= APIkey,
#”Content-Type”=”application/json”, Accept=”application/json”],
Content=JsonContent
]),
JsonResponse = Json.Document(Response,1252)
in
JsonResponse (Query end)
Error:
An error occurred in the ‘’ query. DataSource.Error: Web.Contents failed to get contents from 'https://vizguru.cognitiveservices.azure.com/text/analytics/v2.1/sentiment/keyPhrases' (404): Resource Not Found
Details:
DataSourceKind=Web
DataSourcePath=https://vizguru.cognitiveservices.azure.com/text/analytics/v2.1/sentiment/keyPhrases
Url=https://vizguru.cognitiveservices.azure.com/text/analytics/v2.1/sentiment/keyPhrases
Expected Output:
ID, text, Score,
1, #VirginAmerica What #dhepburn said., 2
2, #VirginAmerica plus you've added commercials to the experience..., 3
3, #VirginAmerica I didn't today... Must mean I need to take another !,4
4, #VirginAmerica it's really aggressive to blast, 5
5, #VirginAmerica and it's a really big bad thing about it, 6
Steps Followed from this link.
Please suggest a solution.
Thanks,
Shiva
Yes, I can repro your 404 error on my side . The reason is the text analysis endpoint:
https://westcentralus.api.cognitive.microsoft.com/text/analytics
not works anymore .
If you want to use azure cognitive service text analytics API, pls refer to this guide to find correct endpoint and ways to call text analysis service .
If you have any further concerns ,pls feel free to let me know .

Sampling rate for data returned with boto3 get_metric_statistics()

The documentation is here...
http://boto3.readthedocs.io/en/latest/reference/services/cloudwatch.html#CloudWatch.Client.get_metric_statistics
Here is our call
response = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization', #reported every 5 minutes
Dimensions=[
{
'Name': 'AutoScalingGroupName',
'Value': 'Celery-AutoScalingGroup'
},
],
StartTime=now - datetime.timedelta(minutes=12),
EndTime=now,
Period=60, #I can't figure out what exactly changing this is doing
Statistics=['Average','SampleCount','Sum','Minimum','Maximum'],
)
Here is our response
>>> response['Datapoints']
[ {u'SampleCount': 5.0, u'Timestamp': datetime.datetime(2017, 8, 25, 12, 46, tzinfo=tzutc()), u'Average': 0.05, u'Maximum': 0.17, u'Minimum': 0.0, u'Sum': 0.25, u'Unit': 'Percent'},
{u'SampleCount': 5.0, u'Timestamp': datetime.datetime(2017, 8, 25, 12, 51, tzinfo=tzutc()), u'Average': 0.034, u'Maximum': 0.08, u'Minimum': 0.0, u'Sum': 0.17, u'Unit': 'Percent'}
]
Here is my question
Look at first dictionary in the returned list. SampleCount of 5 makes sense, I guess, because our Period is 60 (seconds) and CloudWatch supplies 'CPUUtilization' metric every 5 minutes.
But if I change Period, to say 3 minutes (180), I am still getting a SampleCount of 5 (I'd expect 1 or 2).
This is a problem because I want the Average, but I think it is averaging 5 datapoints, only 2 of which are valid (the beginning and end, which correspond to the Min and Max, that is the CloudWatch metric at some time t and the next reporting of that metric at time t+5minutes).
It is averaging this with 3 intermediate 0-value datapoints so that the Average is (Minimum+Maximum+0+0+0)/5
I can just get the Minumum, Maximum add them and divide by 2 for a better reading - but I was hoping somebody could explain just exactly what that 'Period' parameter is doing.
Like I said, changing it to 360 didn't change SampleCount, but when I changed it to 600, suddenly my SampleCount was 10.0 for one datapoint (that does make sense).
Data can be published to CloudWatch in two different ways:
You can publish your observations one by one and let CloudWatch do the aggregation.
You can aggregate the data yourself and publish the statistic set (SampleCount, Sum, Minimum, Maximum).
If data is published using method 1, you would get the behaviour you were expecting. But if the data is published using method 2, you are limited by the granularity of published data.
If ec2 is aggregating the data for 5 min and then publishing statistic set, there is no point in requesting data at 3 min level. However, if you request data with the period that is multiple of the period data was published with (eg. 10 min) stats can be calculated, which CloudWatch does.

How do I combine multiple kinesis data streams into 1 data stream?

I have data coming in from 3 different servers(3 data streams). Is it possible to merge these data (kind of update/upsert) in kinesis consumer application and get the updated data as output?
The data I have from stream 2,3 is dependent on stream 1. For example
Stream 1 (ItemDetails) - {itemId, itemName, itemdescription}, stream
2 (ItemtoCart) - {ItemId}, stream 3 (Itemordered) - {ItemId}.
Final Stream output I am expecting is
OutputStream - {itemId, itemName, itemdescription, itemtoCart_flag,
itemOrdered_flag}
Stream 1 is receiving the data at the rate of 10K records/sec.
Say there are three streams as below,
stream event in stream
stream1(ItemPurchased) - {"item" : 1, "totalQuantity": 100}
stream2(ItemOrdered) - {"item" : 1, "sold": 1}
stream3(ItemCancelled) - {"item" : 1, "orderCancelled": 1}
The streams are for Item purchase, then sold and or cancelled.
Say, I wanna build a final state of item available quantity from these events.
What I would do is,
consume each event in stream / kinesis has lambda functionality but not sure how easily it talks to non aws datastores like MongoDB, Cassandra
and have a logic to build a final state based on event
State transition table
stream events consumer/onEvent state (could be MongoDB, Cassandra)
stream1(ItemPurchased) - {"item" : 1, "totalQuantity": 100} -> create new state -> {"item" : 1, "availableQuantity": 100}
stream2(ItemOrdered) - {"item" : 1, "sold": 1} -> decrease the quantity -> {"item" : 1, "availableQuantity": 100 - 1}
stream3(ItemCancelled) - {"item" : 1, orderCancelled: 1} -> increase the quantity -> {"item" : 1, "availableQuantity": 99 + 1}
Hope that answers your question, but unlike you asked it is the final state table not a stream.

How to use window.frequent?

Anybody can give me a example about how to use window.frequent?
For example,
I write a test,
"define stream cseEventStream (symbol string, price float, time long);" +
"" +
"#info(name = 'query1') " +
"from cseEventStream[700 > price]#window.frequent(3, symbol) " +
"select symbol, price, time " +
"insert expired events into outputStream;";
But from the outputStream, i can't find out the rule.
Thanks.
In this particular query 'window.frequent(3, symbol)' will make the query to find the most frequent 3 symbols(or 3 symbols that has the highest number of occurrences). But, when you insert events to outputStream you have inserted only expired events. So that, as the end result this query will output events that are expired from the frequent window.
In a frequent window, expired events are events that are not belonging to a frequent group anymore. In this case events which are the symbol is not among 3 symbols that has the highest number of occurrences.
for an example if you send the following sequence of events,
{"symbolA", 71.36f, 100}
{"symbolB", 72.36f, 100}
{"symbolB", 74.36f, 100}
{"symbolC", 73.36f, 100}
{"symbolC", 76.36f, 100}
{"symbolD", 76.36f, 100}
{"symbolD", 76.36f, 100}
The query will output {"symbolA", 71.36f, 100}.
When you send the events with 'symbolD'. SymbolA will not be among the top3 symbols with highest number of occurrences anymore so that event with symbolA is expired and {"symbolA", 71.36f, 100} is emitted.
For every event which has a price > 700, this window will retain most frequent 3 items based on symbol and since the output type is 'expired events' you will only receive output once an event loose it's position as a frequent event.
Ex: for frequent window of size 2
Input
WSO2 1000 1
WSO2 1000 2
ABC 700 3
XYZ 800 4
Output
ABC 700 3
ABC event was in the frequent window and was expired upon receiving of XYZ event. If you use default output which is 'current events' it will output all incoming events which are selected as frequent events and put into the window.
Implementation is based on Misra-Gries counting algorithm.
Documentation : https://docs.wso2.com/display/CEP400/Inbuilt+Windows#InbuiltWindows-frequent
Test cases : https://github.com/wso2/siddhi/blob/master/modules/siddhi-core/src/test/java/org/wso2/siddhi/core/query/window/FrequentWindowTestCase.java

Store data group by datetime column using pig

Say that I have the dataset like this
1, 3, 2015-03-25 11-15-13
1, 4, 2015-03-26 11-16-14
1, 4, 2015-03-25 11-16-15
1, 5, 2015-03-27 11-17-11
...
I want to store the data by datetime
so I will have the following output folders
2015-03-25/
2015-03-26/
2015-03-27/
...
How to do that with pig?
Thank you
You can use MultiStorage for this.
Use a FOREACH GENERATE to create a column that contains the date part that you are interested and then something like
STORE X INTO '/my/home/output' USING MultiStorage('/my/home/output','2');