I have data coming in from 3 different servers(3 data streams). Is it possible to merge these data (kind of update/upsert) in kinesis consumer application and get the updated data as output?
The data I have from stream 2,3 is dependent on stream 1. For example
Stream 1 (ItemDetails) - {itemId, itemName, itemdescription}, stream
2 (ItemtoCart) - {ItemId}, stream 3 (Itemordered) - {ItemId}.
Final Stream output I am expecting is
OutputStream - {itemId, itemName, itemdescription, itemtoCart_flag,
itemOrdered_flag}
Stream 1 is receiving the data at the rate of 10K records/sec.
Say there are three streams as below,
stream event in stream
stream1(ItemPurchased) - {"item" : 1, "totalQuantity": 100}
stream2(ItemOrdered) - {"item" : 1, "sold": 1}
stream3(ItemCancelled) - {"item" : 1, "orderCancelled": 1}
The streams are for Item purchase, then sold and or cancelled.
Say, I wanna build a final state of item available quantity from these events.
What I would do is,
consume each event in stream / kinesis has lambda functionality but not sure how easily it talks to non aws datastores like MongoDB, Cassandra
and have a logic to build a final state based on event
State transition table
stream events consumer/onEvent state (could be MongoDB, Cassandra)
stream1(ItemPurchased) - {"item" : 1, "totalQuantity": 100} -> create new state -> {"item" : 1, "availableQuantity": 100}
stream2(ItemOrdered) - {"item" : 1, "sold": 1} -> decrease the quantity -> {"item" : 1, "availableQuantity": 100 - 1}
stream3(ItemCancelled) - {"item" : 1, orderCancelled: 1} -> increase the quantity -> {"item" : 1, "availableQuantity": 99 + 1}
Hope that answers your question, but unlike you asked it is the final state table not a stream.
Related
I have searched all over StackOverflow / Reddit / etc, but can't seem to find anything like this.
Assuming we have the following model (purposely simplified):
class X(models.Model):
related_object = models.ForeignKey(Y)
start_date = models.DateField(unique=True)
is_rented = models.BooleanField()
Where model X is tracking state changes of an instance of model Y. It could be tracking something like the state of cars at a car rental agency that may be rented out. A new instance of the model is created each time the state of each car changes.
The resulting objects might look something like this:
{"id" : 1, "related_object" : 2, "start_date" : "2021-03-03", "is_rented" : False},
{"id" : 2, "related_object" : 2, "start_date" : "2021-03-06", "is_rented" : False},
{"id" : 3, "related_object" : 2, "start_date" : "2021-03-10", "is_rented" : True},
{"id" : 4, "related_object" : 2, "start_date" : "2021-03-15", "is_rented" : False},
{"id" : 5, "related_object" : 2, "start_date" : "2021-03-16", "is_rented" : True},
{"id" : 6, "related_object" : 4, "start_date" : "2021-03-16", "is_rented" : False},
{"id" : 7, "related_object" : 2, "start_date" : "2021-03-17", "is_rented" : False},
{"id" : 8, "related_object" : 4, "start_date" : "2021-03-22", "is_rented" : True},
I want to be able to perform the following queries:
For a provided date, return only the X instance that is active for that date
For a provided date range, return all X instances that overlap with the range
Everything I find on SO seems to involve models with both start and end date fields. In this case, we are assuming that a model instance continues in it's state until a new instance is created with a date further into the future that a previous instance. For instance, the instance with id=1 above would essentially have an end date of 2021-03-06. Both instance id=7 and id=8 essentially do not have an end date (forever into the future) until a new instance is created with dates after theirs.
It seems like this should be a really common use case anytime someone wants to track state changes and the end date isn't necessarily known. Or I may want to be able to insert a new record between, say, id=3 and id=4, with a date of 2021-03-11, and I shouldn't have to recalculate the value of id=3 to adjust the end date manually.
Examples:
If I query for instances of X where related_object=2 and the date I'm
interested in is 2021-03-07, I should get back the X instance where
id=2.
If I query for instances of X where related_object=2 and the range of
dates I am interested in are 2021-03-04 through 2021-03-11, I should
get back the X instances of id=1, 2, and 3, but because id=1 has a date that starts below the date of the query, it will get excluded, even though it's status was active on 2021-03-04.
Any recommendations for how to approach this?
I have a requirement to collect multiple events into a single one and consider it as a single event. The output stream should include the input events as a list.
Basically, when the following events are passed
{InvoiceNumber:1, LineItem: 1, Value: 100}
{InvoiceNumber:1, LineItem: 2, Value: 200}
{InvoiceNumber:1, LineItem: 3, Value: 300}
I need the output to be like the following
[InvoiceNumber:1, lineItems: [{LineItem: 1, Value: 100}, {LineItem: 2, Value: 200}, {LineItem: 3, Value: 300}]
How do I achieve this with WSO2 Streaming Integrator? Or Siddhi.io.
I attempted the following but it still inserting each stream into the output stream
partition with (InvoiceNo of CSVInputStream)
begin
from every e1=CSVInputStream
within 1 min
select e1.InvoiceNo, list:collect(e1.LineItem) as lineItems
group by e1.InvoiceNo
insert into AggregatedStream;
end;
Do not use partitions, as this is a simple use case, try windows. Time batch windows in your case, https://siddhi.io/en/v5.1/docs/api/latest/#timebatch-window
from CSVInputStream#window.timeBatch(1 min)
select e1.InvoiceNo, list:collect(e1.LineItem) as lineItems
group by e1.InvoiceNo
insert into AggregatedStream;
The documentation is here...
http://boto3.readthedocs.io/en/latest/reference/services/cloudwatch.html#CloudWatch.Client.get_metric_statistics
Here is our call
response = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization', #reported every 5 minutes
Dimensions=[
{
'Name': 'AutoScalingGroupName',
'Value': 'Celery-AutoScalingGroup'
},
],
StartTime=now - datetime.timedelta(minutes=12),
EndTime=now,
Period=60, #I can't figure out what exactly changing this is doing
Statistics=['Average','SampleCount','Sum','Minimum','Maximum'],
)
Here is our response
>>> response['Datapoints']
[ {u'SampleCount': 5.0, u'Timestamp': datetime.datetime(2017, 8, 25, 12, 46, tzinfo=tzutc()), u'Average': 0.05, u'Maximum': 0.17, u'Minimum': 0.0, u'Sum': 0.25, u'Unit': 'Percent'},
{u'SampleCount': 5.0, u'Timestamp': datetime.datetime(2017, 8, 25, 12, 51, tzinfo=tzutc()), u'Average': 0.034, u'Maximum': 0.08, u'Minimum': 0.0, u'Sum': 0.17, u'Unit': 'Percent'}
]
Here is my question
Look at first dictionary in the returned list. SampleCount of 5 makes sense, I guess, because our Period is 60 (seconds) and CloudWatch supplies 'CPUUtilization' metric every 5 minutes.
But if I change Period, to say 3 minutes (180), I am still getting a SampleCount of 5 (I'd expect 1 or 2).
This is a problem because I want the Average, but I think it is averaging 5 datapoints, only 2 of which are valid (the beginning and end, which correspond to the Min and Max, that is the CloudWatch metric at some time t and the next reporting of that metric at time t+5minutes).
It is averaging this with 3 intermediate 0-value datapoints so that the Average is (Minimum+Maximum+0+0+0)/5
I can just get the Minumum, Maximum add them and divide by 2 for a better reading - but I was hoping somebody could explain just exactly what that 'Period' parameter is doing.
Like I said, changing it to 360 didn't change SampleCount, but when I changed it to 600, suddenly my SampleCount was 10.0 for one datapoint (that does make sense).
Data can be published to CloudWatch in two different ways:
You can publish your observations one by one and let CloudWatch do the aggregation.
You can aggregate the data yourself and publish the statistic set (SampleCount, Sum, Minimum, Maximum).
If data is published using method 1, you would get the behaviour you were expecting. But if the data is published using method 2, you are limited by the granularity of published data.
If ec2 is aggregating the data for 5 min and then publishing statistic set, there is no point in requesting data at 3 min level. However, if you request data with the period that is multiple of the period data was published with (eg. 10 min) stats can be calculated, which CloudWatch does.
Anybody can give me a example about how to use window.frequent?
For example,
I write a test,
"define stream cseEventStream (symbol string, price float, time long);" +
"" +
"#info(name = 'query1') " +
"from cseEventStream[700 > price]#window.frequent(3, symbol) " +
"select symbol, price, time " +
"insert expired events into outputStream;";
But from the outputStream, i can't find out the rule.
Thanks.
In this particular query 'window.frequent(3, symbol)' will make the query to find the most frequent 3 symbols(or 3 symbols that has the highest number of occurrences). But, when you insert events to outputStream you have inserted only expired events. So that, as the end result this query will output events that are expired from the frequent window.
In a frequent window, expired events are events that are not belonging to a frequent group anymore. In this case events which are the symbol is not among 3 symbols that has the highest number of occurrences.
for an example if you send the following sequence of events,
{"symbolA", 71.36f, 100}
{"symbolB", 72.36f, 100}
{"symbolB", 74.36f, 100}
{"symbolC", 73.36f, 100}
{"symbolC", 76.36f, 100}
{"symbolD", 76.36f, 100}
{"symbolD", 76.36f, 100}
The query will output {"symbolA", 71.36f, 100}.
When you send the events with 'symbolD'. SymbolA will not be among the top3 symbols with highest number of occurrences anymore so that event with symbolA is expired and {"symbolA", 71.36f, 100} is emitted.
For every event which has a price > 700, this window will retain most frequent 3 items based on symbol and since the output type is 'expired events' you will only receive output once an event loose it's position as a frequent event.
Ex: for frequent window of size 2
Input
WSO2 1000 1
WSO2 1000 2
ABC 700 3
XYZ 800 4
Output
ABC 700 3
ABC event was in the frequent window and was expired upon receiving of XYZ event. If you use default output which is 'current events' it will output all incoming events which are selected as frequent events and put into the window.
Implementation is based on Misra-Gries counting algorithm.
Documentation : https://docs.wso2.com/display/CEP400/Inbuilt+Windows#InbuiltWindows-frequent
Test cases : https://github.com/wso2/siddhi/blob/master/modules/siddhi-core/src/test/java/org/wso2/siddhi/core/query/window/FrequentWindowTestCase.java
I want to create a rrd file with two data souces incouded. One stores the original value the data, name it 'dc'. The other stores the accumulated result of 'dc', name it 'total'. The expected formula is current(total) = previous(total) + current(dc). For example, If I update the data sequence (2, 3, 5, 4, 9) to the rrd file, I want 'dc' is (2, 3, 5, 4, 9) and 'total' is (2, 5, 15, 19, 28).
I tried to create the rrd file with the command line below. The command fails and says that the PREV are not supported with DS COMPUTE.
rrdtool create test.rrd --start 920804700 --step 300 \
DS:dc:GAUGE:600:0:U \
DS:total:COMPUTE:PREV,dc,ADDNAN \
RRA:AVERAGE:0.5:1:1200 \
RRA:MIN:0.5:12:2400 \
RRA:MAX:0.5:12:2400 \
RRA:AVERAGE:0.5:12:2400
Is there an alternative manner to define the DS 'total' (DS:total:COMPUTE:PREV,dc,ADDNAN) ?
rrdtool does not store 'original' values ... it rather samples to signal you provide via the update command at the rate you defined when you setup the database ... in your case 1/300 Hz
that said, a total does not make much sense ...
what you can do with a single DS though, is build the average value over a time range and multiply the result with the number of seconds in the time range and thus arrive at the 'total'.
Sorry a bit late but may be helpful for someone else.
Better to use RRDtool's ' mrtg-traffic-sum ' package which when I'm using an rrd with GAUGE DS & LAST a the RRA's so it's allowing me to collect monthly traffic volumes & quota limits.
eg: Here is a basic Traffic chart with no traffic quota.
root#server:~# /usr/bin/mrtg-traffic-sum --range=current --units=MB /etc/mrtg/R4.cfg
Subject: Traffic total for '/etc/mrtg/R4.cfg' (1.9) 2022/02
Start: Tue Feb 1 01:00:00 2022
End: Tue Mar 1 00:59:59 2022
Interface In+Out in MB
------------------------------------------------------------------------------
eth0 0
eth1 14026
eth2 5441
eth3 0
eth4 15374
switch0.5 12024
switch0.19 151
switch0.49 1
switch0.51 0
switch0.92 2116
root#server:~#
From this you can then write up a script that will generate a new rrd which stores these values & presto you have a traffic volume / quota graph.
Example fixed traffic volume chart using GAUGE
This thread reminded me to fix this collector that had stopped & just got around to posting ;)