match_id for scan operator - state

I’m trying to use “scan” operator for analyzing state transitions in Azure fleet telemetry. Here is a trimmed down version of data where I’m trying to establish sessions (Each session is basically node states between “Ready” and “HumanInvestigate”). I couldn’t understand the m_id values for the rows highlighted in my output below.
datatable(Ts: timespan, nodeState:string)
[
0m, "Ready",
1m, "Raw",
2m, "Ready",
3m, "HumanInvestigate",
4m, "Ready",
5m, "Raw",
6m, "HumanInvestigate",
7m, "Ready",
8m, "Raw",
9m, "HumanInvestigate"
]
| sort by Ts asc
| scan with_match_id=m_id with
(
step s1: nodeState == "Ready";
step s2: nodeState != "Ready" and nodeState != "HumanInvestigate";
step s3: nodeState == "HumanInvestigate";
)
Here is my expected output. Can you please help on how to achieve this.

‘scan’ implements a linear state machine. It is optimized for speed and memory, so it does a single pass over the data and keeps the memory for storing the states limited to the number of states x the size of a single record. This limitation means that state transitions are forward only, and in case of overlapping sequences the behavior is more complex, sometimes not intuitive. In your case the state transitions are from ‘Ready’ to ‘Raw’ to ‘HumanInvestigate’. At 1m the first match (m_id == 0) moves to step ‘s2’, and step ‘s1’ is empty, ready for a new match. At 2m, the ‘Ready’ matches ‘s1’, starting a new match (m_id == 1). At this point we have to active sequences: m_id==0 at ‘s2’ and m_id==1 at ‘s1’. At 3m there is a match for m_id ==0, moving it from ‘s2’ to ‘s3’.
To achieve your expected output, you should use 2 states:
datatable(Ts: timespan, nodeState:string)
[
0m, "Ready",
1m, "Raw",
2m, "Ready",
3m, "HumanInvestigate",
4m, "Ready",
5m, "Raw",
6m, "HumanInvestigate",
7m, "Ready",
8m, "Raw",
9m, "HumanInvestigate"
]
| sort by Ts asc
| scan with_match_id=m_id with
(
step s1: iff(nodeState == '', nodeState == "Ready", nodeState != "HumanInvestigate"); // a new sequence must start with 'Ready'
step s2: nodeState == "HumanInvestigate"; // a sequence ends with 'HumanInvestigate'
)
Ts nodeState m_id
00:00:00 Ready 0
00:01:00 Raw 0
00:02:00 Ready 0
00:03:00 HumanInvestigate 0
00:04:00 Ready 1
00:05:00 Raw 1
00:06:00 HumanInvestigate 1
00:07:00 Ready 2
00:08:00 Raw 2
00:09:00 HumanInvestigate 2

Related

Filter data using IF Statement in Tableau

I have a data source in tableau that looks something similar to this:
SKU Backup_Storage
A 5
A 1
B 2
B 3
C 1
D 0
I'd like to create a calculated field in tableau that performs a SUM calculation IF the SKU column contains the string 'A' or 'D' , and to perform an AVERAGE calculation if the SKU column contains the letters 'C' or 'B'
This is what I am doing:
IF CONTAINS(ATTR([SKU]),'A') or
CONTAINS(ATTR([SKU]),'D')
THEN SUM([Backup_Storage])
ELSEIF CONTAINS(ATTR([SKU]),'B') or
CONTAINS(ATTR([SKU]),'C')
THEN AVG([Backup_Storage])
END
UPDATE - desired output would be:
SKU BACKUP
A, D 6 (This is the SUM OF A and D)
B, C 2 (This is the AVG of B and C)
The calculation above shows as valid, however, I see NULLS in my data source table.
Any suggestion is appreciated.
I have named the calculated field:
SKU_FILTER_CALCULATION
Basically, IF THEN ELSE condition works when one test that is either TRUE/FALSE. Your specified condition is not a proper use case of IF THEN ELSE because SKUs can take all possible values. See it like this..
your data
SKU Backup_Storage
A 5
A 1
B 2
B 3
C 1
D 0
Let's name your calc field as CF, then CF will take value A in first row and will output SUM(5) = 5. For second row it will output sum(1) = 1, for third and onward rows it will output as avg(2) = 2, avg(3) = 3, avg(1) and sum(0) respectively. all these values just equals [Backup_storage] only and I'm sure that this you're not trying to get.
If instead you are trying to get sum(5,1,0) + avg(2,3,1) (obviously i have assumed + here) which equals 8 i.e. one single value for whole dataset, please proceed with this calculated field..
SUM(IF CONTAINS([SKU], 'A') OR CONTAINS([SKU], 'D')
THEN [Backup storage] END)
+
AVG(IF CONTAINS([SKU], 'B') OR CONTAINS([SKU], 'C')
THEN [Backup storage] END)
This will return an 8 when put to view
Needless to say, if you want any other operator instead of + you have to change that in CF accordingly
As per your edited post, I suggest a different methodology. Create diff groups where you want to perform different aggregations
Step-1 Create groups on SKU field. I have named this group as SKUG
Step-2 create a calculated field CF as
SUM(ZN(IF CONTAINS([SKU], 'A') OR CONTAINS([SKU], 'D')
THEN [Backup storage] END))
+
AVG(ZN(IF CONTAINS([SKU], 'B') OR CONTAINS([SKU], 'C')
THEN [Backup storage] END))
Step-3 get your desired view
Good Luck

A list containing NOBODY as one of its entities

I have a sub-routing in my code where each patch is asked to pick its closest & farthest turtle based on certain conditions. I keep getting this error after a couple of ticks
OF expected input to be a turtle agentset or turtle but got NOBODY instead.
error while patch 0 30 running OF
called by procedure UPDATE-SUPPORT
called by procedure GO
called by Button 'Go'
There are two other routines where a turtle dies or is born depending on a few other metrics that are measured. I am not able to debug the code but what i have figured so far is that it happens after a turtle dies or is born.
Below is the code based on which the closest & farthest turtles are assigned at each tick.
to update-support
ask patches [
let old-total sum [my-old-size] of parties
set f-party []
set h-party []
set party-list (sort parties)
set voteshare-list n-values length(party-list) [ (([my-old-size] of party ? ) + 1 ) / ( old-total + 1 ) ]
set party-citizen-dist n-values length(party-list) [ ( distance party ? ) ^ 2 ]
set f-list n-values length(party-list) [ ( ( 1 / ( item ? voteshare-list ) ) * ( item ? party-citizen-dist ) ) ]
set f-index position (min f-list) f-list
set h-list n-values length(party-list) [ ( ( item ? voteshare-list ) * ( item ? party-citizen-dist ) ) ]
set h-index position (max h-list) h-list
set f ((-1) * (min f-list))
set h max h-list
set f-party lput item f-index party-list f-party
set h-party lput item h-index party-list h-party
set closest-party first f-party
set farthest-party first h-party
]
After a turtle dies, when I inspected the patch which was throwing the error, i found the word nobody as an element in the list. The error is highlighted to be in the Party ? section while creating the voteshare-list in the above code
When I inspected the patch throwing the error, Party-list which is the list with all the current parties sorted was showing this:
Party-list: [(party 0) nobody (party 2)]
and my f-party list just had [(nobody)]
Has anyone faced such a situation.?
Below is the death & birth routine:
to party-death
ask parties [if (fitness < survival-threshold and count parties > 2)
[ die
] update-support
]
to party-birth
ifelse (endogenous-birth? = true)
[ ask one-of patches with [distancexy 0 0 < 30]
[ if (random-float 1 < (kpi * 1000)) [sprout-parties 1 [initialize-party] ]]
[ create-parties 1 [set heading random-float 360 jump random-float 30 initialize-party] ]
update-support
end

Stata: Counting number of consecutive occurrences of a pre-defined length

Observations in my data set contain the history of moves for each player. I would like to count the number of consecutive series of moves of some pre-defined length (2, 3 and more than 3 moves) in the first and the second halves of the game. The sequences cannot overlap, i.e. the sequence 1111 should be considered as a sequence of the length 4, not 2 sequences of length 2. That is, for an observation like this:
+-------+-------+-------+-------+-------+-------+-------+-------+
| Move1 | Move2 | Move3 | Move4 | Move5 | Move6 | Move7 | Move8 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 1 | 1 | 1 | 1 | . | . | 1 | 1 |
+-------+-------+-------+-------+-------+-------+-------+-------+
…the following variables should be generated:
Number of sequences of 2 in the first half =0
Number of sequences of 2 in the second half =1
Number of sequences of 3 in the first half =0
Number of sequences of 3 in the second half =0
Number of sequences of >3 in the first half =1
Number of sequences of >3 in the second half = 0
I have two potential options of how to proceed with this task but neither of those leads to the final solution:
Option 1: Elaborating on Nick’s tactical suggestion to use strings (Stata: Maximum number of consecutive occurrences of the same value across variables), I have concatenated all “move*” variables and tried to identify the starting position of a substring:
egen test1 = concat(move*)
gen test2 = subinstr(test1,"11","X",.) // find all consecutive series of length 2
There are several problems with Option 1:
(1) it does not account for cases with overlapping sequences (“1111” is recognized as 2 sequences of 2)
(2) it shortens the resulting string test2 so that positions of X no longer correspond to the starting positions in test1
(3) it does not account for variable length of substring if I need to check for sequences of the length greater than 3.
Option 2: Create an auxiliary set of variables to identify the starting positions of the consecutive set (sets) of the 1s of some fixed predefined length. Building on the earlier example, in order to count sequences of length 2, what I am trying to get is an auxiliary set of variables that will be equal to 1 if the sequence of started at a given move, and zero otherwise:
+-------+-------+-------+-------+-------+-------+-------+-------+
| Move1 | Move2 | Move3 | Move4 | Move5 | Move6 | Move7 | Move8 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
+-------+-------+-------+-------+-------+-------+-------+-------+
My code looks as follows but it breaks when I am trying to restart counting consecutive occurrences:
quietly forval i = 1/42 {
gen temprow`i' =.
egen rowsum = rownonmiss(seq1-seq`i') //count number of occurrences
replace temprow`i'=rowsum
mvdecode seq1-seq`i',mv(1) if rowsum==2
drop rowsum
}
Does anyone know a way of solving the task?
Assume a string variable concatenating all moves all (the name test1 is hardly evocative).
FIRST TRY: TAKING YOUR EXAMPLE LITERALLY
From your example with 8 moves, the first half of the game is moves 1-4 and the second half moves 5-8. Thus there is for each half only one way to have >3 moves, namely that there are 4 moves. In that case each substring will be "1111" and counting reduces to testing for the one possibility:
gen count_1_4 = substr(all, 1, 4) == "1111"
gen count_2_4 = substr(all, 5, 4) == "1111"
Extending this approach, there are only two ways to have 3 moves in sequence:
gen count_1_3 = inlist(substr(all, 1, 4), "111.", ".111")
gen count_2_3 = inlist(substr(all, 5, 4), "111.", ".111")
In similar style, there can't be two instances of 2 moves in sequence in each half of the game as that would qualify as 4 moves. So, at most there is one instance of 2 moves in sequence in each half. That instance must match either of two patterns, "11." or ".11". ".11." is allowed, so either includes both. We must also exclude any false match with a sequence of 3 moves, as just mentioned.
gen count_1_2 = (strpos(substr(all, 1, 4), "11.") | strpos(substr(all, 1, 4), ".11") ) & !count_1_3
gen count_2_2 = (strpos(substr(all, 5, 4), "11.") | strpos(substr(all, 5, 4), ".11") ) & !count_2_3
The result of each strpos() evaluation will be positive if a match is found and (arg1 | arg2) will be true (1) if either argument is positive. (For Stata, non-zero is true in logical evaluations.)
That's very much tailored to your particular problem, but not much worse for that.
P.S. I didn't try hard to understand your code. You seem to be confusing subinstr() with strpos(). If you want to know positions, subinstr() cannot help.
SECOND TRY
Your last code segment implies that your example is quite misleading: if there can be 42 moves, the approach above can not be extended without pain. You need a different approach.
Let's suppose that the string variable all can be 42 characters long. I will set aside the distinction between first and second halves, which can be tackled by modifying this approach. At its simplest, just split the history into two variables, one for the first half and one for the second and repeat the approach twice.
You can clone the history by
clonevar work = all
gen length1 = .
gen length2 = .
and set up your count variables. Here count_4 will hold counts of 4 or more.
gen count_4 = 0
gen count_3 = 0
gen count_2 = 0
First we look for move sequences of length 42, ..., 2. Every time we find one, we blank it out and bump up the count.
qui forval j = 42(-1)2 {
replace length1 = length(work)
local pattern : di _dup(`j') "1"
replace work = subinstr(work, "`pattern'", "", .)
replace length2 = length(work)
if `j' >= 4 {
replace count4 = count4 + (length1 - length2) / `j'
}
else if `j' == 3 {
replace count3 = count3 + (length1 - length2) / 3
}
else if `j' == 2 {
replace count2 = count2 + (length1 - length2) / 2
}
}
The important details here are
If we delete (repeated instances of) a pattern and measure the change in length, we have just deleted (change in length) / (length of pattern) instances of that pattern. So, if I look for "11" and found that the length decreased by 4, I just found two instances.
Working downwards and deleting what we found ensures that we don't find false positives, e.g. if "1111111" is deleted, we don't find later "111111", "11111", ..., "11" which are included within it.
Deletion implies that we should work on a clone in order not to destroy what is of interest.

Couchdb: relational database capabilities

Let's assume that I have a list of 239800 documents like the following:
{
name: somename,
data:{age:someage, income:somevalue, height:someheight, dumplings_consumed:somenumber}
}
I know that I can index the doc by doc.data.age, doc.data.income, height, dumplings_consumed and get list of the doc that after giving a range for each parameters but how can I get a result for query like following:
List of the docs where age is between 25 and 30, income is less than $10 and height is more than 7ft?
Is there a way to get multiple indexes working?
Assuming all three of your example query parameters need to remain dynamic, you would not be able to do such a join with a single CouchDB query. The simplest strategy would be to emit an index that lets you narrow down the "biggest" aspect/dimension of your data, and then filter the rest out in your app's code or a _list function.
Now, for filtering on two aspects of numeric data, GeoCouch could potentially be used — it provides a generic 2-dimensional index, not just limited to latitude and longitude! So you would emit points that contain (say) "age" and "income" mapped to x and y. You'd then query a bbox with first two "between" parameters, and then you'd only have to filter out height on the app side.
Let's have a look at:
http://guide.couchdb.org/draft/views.html
You can search with any expression you want (javascript code) and index documents with it.
For example, by means of Futon, you can create a test database and add the two following documents based on your question:
{ "_id": "36fef0472fb7eec035c87e4f4b0381bf", "_rev": "12-4ef9014a3670a7e6acd58ad92d26fc1e", "data": { "age": 6, "income": 10, "height": 20, "dumplings_consumed": 5 }, "name": "joe" }
{ "_id": "36fef0472fb7eec035c87e4f4b038ffa", "_rev": "8-f0a0a51b830bf3d4bc3ec5697440792f", "name": "mike", "data": { "age": 27, "income": 9, "height": 78, "dumplings_consumed": 256 } }
You just have to go to your database still with Futon and create a temporary view with the following Map function:
function(doc) { var age, income, height; if (doc.name && doc.data && doc.data.age && doc.data.income && doc.data.height) { if ( doc.data.age > 25 && doc.data.age < 30 && doc.data.income < 10 && doc.data.height > 7) { emit(doc.name, doc.data); } } }
Just run and you get the result.
With a permanent view, first time the request is executed, the internal B-tree is built and it takes time. Further executions should be very fast even if documents are added to the database (as long as their number is a fraction of the totality)

My whole parsing logic suffers because of null character how to resolve this

This is DATA 1
RE00002200050046\00 0.00 0.1 0.125.9\0#####- 14 0##### \0 0##### 141.0\004.00 0: 00.000.0\00 4: 011:27 0: 015:27#\0###########2.00.0\0
Another data that i have is
This is DATA 2
RE000022601\0500460 0.00 0.1\0 0.236.8####\0# 57- 2#####- 3#####\0- 601.004.0\00 4: 00.000.\000 4: 013:37 0\0: 017:37#####\0#######2.00.\00
The above data is the response i get from an hospital machine,i have to parse the above values and fill it according to given format:-
BYTEs 2 2 4 128 2 2
+---------+--------+------------+-----------------+--------+-------+
| RE | 00 | machine no| Data part | Check | CRC |
| | | | | sum | |
+---------+--------+------------+-----------------+--------+-------+
As you can see from DATA 1 My data part begins from "000500.."
and DATA 2 My data part begins from "601\0500..."
While doing parsing i got into a problem that there is field named "Blood pump flow" whose length is 3 bytes from the "DATA 1" we get its value as "46" while from the
"DATA 2" i got its value as "460".
In actual its value should be "460"
If i get a data like DATA 1 my whole parsing logic suffers as because as "Blood pump flow" is "3 bytes" i get a value "46\0" and "0" is added to another field while "Blood pump flow" should be "460".
The above is just one case i get it many times for some other fields too.
How to resolve this problem.
DATA 1 and DATA 2 are the binary data that i get from the machine.
It seems from your example, and it is confirmed by your own comment, that you know the field sizes in your format. So you must treat this input as binary input. Use std::istream::read function.
unsigned char header[14];
is.read(header,14);
if (is.gcount() == 14)
{
// decide which DATA1 or DATA2 you read from header contents
if (header is for DATA1)
// read rest of input as DATA1
// decide which DATA1 or DATA2 you read from header contents
else if (header is for DATA2)
// read rest of input as DATA2
else
//report error
}