Categorize variable with zero values

Categorize variable with zero values - stata

Trying to categorize a variable X which has 82 values as 0, 118 values between 1 and 6, 0 values between 7 and 12, 0 values between 13 and 18, 0 values between 19 and 24.
Tried the following code:
gen X = .
replace X = 1 if Y >= 1 & Y <= 6
replace X = 2 if Y >= 7 & Y <= 12
replace X = 3 if Y >= 13 & Y <= 18
replace X = 4 if Y >= 19 & Y <= 24
I wish to see X categorized as 0, 1-6, 7-12, 13-18, 19-24. Instead of just 0 and 1.
Current Results:
tab X
X Freq. Percent Cum.
0 82 41.00 41.00
1 118 59.00 100.00
Total 200 100.00
* Example generated by -dataex-. To install: ssc install dataex
clear
input int FID byte Y float X
150 0 0
17 0 0
95 1 1
0 0 0
18 0 0
1 0 0
96 0 0
54 0 0
172 3 1
97 0 0
57 1 1
19 0 0
98 1 1
151 0 0
99 1 1
2 3 1
197 1 1
55 2 1
58 1 1
100 0 0
end

Your code does serve for your purpose, i.e. variable X is indeed the right set of categories for variable Y as you intended.
The fact that you only see X in the range 0,1 simply means that the data have no observations with Y falling in other categories. If the data had any Y belonging to other categories, then the correct corresponding values of X would show up.
A direct way to achieve this output is shown below. Just give it a try.
egen YCat = cut(Y), at(0,1,7,13,19,25)

Your code looks fine, except crucially that nothing in your code yields 0 as a result.
However, I disagree with #Romalpa Akzo on recommending egen, cut(). Even an experienced Stata user is unlikely to remember the exact rules used by that function of that command.
Are lower bounds >= or >, in particular? What happens above and below the extreme values mentioned? What if you don't want results 1 up?
I prefer explicit code.
Here's another way to do it. With a programmer's understanding that cond(A, B, C) yields B if A is true (non-zero) and C if A is false (zero), then we can go
clear
set obs 26
generate Y = _n - 1
generate X = cond(Y > 24, ., ///
cond(Y >= 19, 4, ///
cond(Y >= 13, 3, ///
cond(Y >= 7, 2, ///
cond(Y >= 1, 1, 0 )))))
tabulate Y X , missing
| X
Y | 0 1 2 3 4 . | Total
-----------+------------------------------------------------------------------+----------
0 | 1 0 0 0 0 0 | 1
1 | 0 1 0 0 0 0 | 1
2 | 0 1 0 0 0 0 | 1
3 | 0 1 0 0 0 0 | 1
4 | 0 1 0 0 0 0 | 1
5 | 0 1 0 0 0 0 | 1
6 | 0 1 0 0 0 0 | 1
7 | 0 0 1 0 0 0 | 1
8 | 0 0 1 0 0 0 | 1
9 | 0 0 1 0 0 0 | 1
10 | 0 0 1 0 0 0 | 1
11 | 0 0 1 0 0 0 | 1
12 | 0 0 1 0 0 0 | 1
13 | 0 0 0 1 0 0 | 1
14 | 0 0 0 1 0 0 | 1
15 | 0 0 0 1 0 0 | 1
16 | 0 0 0 1 0 0 | 1
17 | 0 0 0 1 0 0 | 1
18 | 0 0 0 1 0 0 | 1
19 | 0 0 0 0 1 0 | 1
20 | 0 0 0 0 1 0 | 1
21 | 0 0 0 0 1 0 | 1
22 | 0 0 0 0 1 0 | 1
23 | 0 0 0 0 1 0 | 1
24 | 0 0 0 0 1 0 | 1
25 | 0 0 0 0 0 1 | 1
-----------+------------------------------------------------------------------+----------
Total | 1 6 6 6 6 1 | 26
Naturally, you could write all the command on one line, but many will find the multiline layout easier to understand and to debug. With nested function calls, each new condition implies a promise to close all the parentheses at the end.
Multiple commands like those in the question are preferred by many Stata users too, so taste is behind many choices.

Related

Retrieve multiple ArrayFire subarrays from min/max data points

I have an array with sections of touching values in it. For example:
0 0 1 0 0 0 0 0 0 0
0 1 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 2 2 2 0 0
0 0 0 0 0 0 0 2 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 3 0 0 0 0 0 0 0
0 0 3 0 0 0 0 0 0 0
0 0 3 0 0 0 0 0 0 0
from this, I created a set of af::arrays: minX, maxX, minY, maxY. These define the box that encloses each group.
so for this example:
minX would be: [1,5,2] // 1 for label(1), 5 for label(2) and 2 for label(3)
maxX would be: [3,7,2] // 3 for label(1), 7 for label(2) and 2 for label(3)
minY would be: [0,3,7] // 0 for label(1), 3 for label(2) and 7 for label(3)
maxY would be: [1,4,9] // 1 for label(1), 4 for label(2) and 9 for label(3)
So if you take the i'th element from each of those arrays, you can get the upperleft/lowerright bounds of a box that encloses the corresponding label.
I would like use these values to pull out subarrays from this larger array. My goal is to put these values enclosed in the boxes into a flat list. In GPU memory, I also have calculated how many entries I would need for each box using the max/min X/Y values. So in this example - the result of the flat list should be:
result=[0 1 0 1 1 1 2 2 2 0 0 2 3 3 3]
where the first 6 entries are from the box
______
|0 1 0 |
|1 1 1 |
------
the second 6 entries are from the box
______
|2 2 2 |
|0 0 2 |
------
and the final three entries are from the box
___
| 3 |
| 3 |
| 3 |
---
I cannot figure out how to index into this af::array with min/max values in memory that resides on the GPU (and do not want to transfer them to the CPU). I was trying to see if gfor/seq would work for me, but it appears that af::seq cannot use array data, and everything I have tried with using af::index i could not get to work for me either.
I am able to change how I represent min/max (I could store indices for upper left/lower right) but my main goal is to do this efficiently on the GPU without moving data back and forth between the GPU and CPU.
How can this be achieved efficiently with ArrayFire?
Thank you for your help

How did you get there so far? which language are you using?
I guess you could be tiling the results to 3rd dimensions to handle each regions separately and end up with min/max vectors in GPU memory.

Printing an std::array gives random values

I am trying to print out an std::array as seen below, the output is supposed to consist of only booleans, but there seem to be numbers in the output aswell (also below). I've tried printing out the elements which give numbers on their own, but then I get their actual value, which is weird.
My main function:
float f(float x, float y)
{
return x * x + y * y - 1;
}
int main()
{
std::array<std::array<bool, ARRAY_SIZE_X>, ARRAY_SIZE_Y> temp = ConvertToBinaryImage(&f);
for(int i = 0; i < (int)temp.size(); ++i)
{
for(int j = 0; j < (int)temp[0].size(); ++j)
{
std::cout << temp[i][j] << " ";
}
std::cout << std::endl;
}
}
The function that sets the array:
std::array<std::array<bool, ARRAY_SIZE_X>, ARRAY_SIZE_Y> ConvertToBinaryImage(float(*func)(float, float))
{
std::array<std::array<bool, ARRAY_SIZE_X>, ARRAY_SIZE_Y> result;
for(float x = X_MIN; x <= X_MAX; x += STEP_SIZE)
{
for(float y = Y_MIN; y <= Y_MAX; y += STEP_SIZE)
{
int indx = ARRAY_SIZE_X - (x - X_MIN) * STEP_SIZE_INV;
int indy = ARRAY_SIZE_Y - (y - Y_MIN) * STEP_SIZE_INV;
result[indx][indy] = func(x, y) < 0;
}
}
return result;
}
The constants
#define X_MIN -1
#define Y_MIN -1
#define X_MAX 1
#define Y_MAX 1
#define STEP_SIZE_INV 10
#define STEP_SIZE (float)1 / STEP_SIZE_INV
#define ARRAY_SIZE_X (X_MAX - X_MIN) * STEP_SIZE_INV
#define ARRAY_SIZE_Y (Y_MAX - Y_MIN) * STEP_SIZE_INV
My output:
184 225 213 111 0 0 0 0 230 40 212 111 0 0 0 0 64 253 98 0
0 0 0 0 1 0 1 0 1 1 1 1 6 1 0 0 168 0 0 0
0 183 213 111 0 0 0 0 0 0 0 0 0 0 0 0 9 242 236 108
0 0 0 1 64 1 1 0 1 1 1 1 240 1 1 1 249 1 0 0
0 21 255 0 0 0 0 0 98 242 236 108 0 0 0 0 0 0 0 0
0 0 0 1 128 1 1 0 1 1 1 1 128 1 1 1 0 1 1 0
0 1 255 1 0 1 1 0 1 1 1 1 0 1 1 1 31 1 1 1
0 0 0 0 184 225 213 111 0 0 0 0 2 0 0 0 0 0 0 0
9 1 0 1 0 1 1 0 1 1 1 1 0 1 1 1 64 1 1 1
0 1 0 1 64 1 1 0 1 1 1 1 96 1 1 1 249 1 1 1
0 1 213 1 0 1 1 0 1 1 1 1 0 1 1 1 32 1 1 1
0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1
0 21 255 0 0 0 0 0 80 59 117 0 0 0 0 0 32 112 64 0
0 1 0 1 17 1 1 16 1 1 1 1 104 1 1 1 0 1 1 1
0 0 144 1 249 1 1 0 1 1 1 1 0 1 1 1 0 1 1 0
0 0 0 1 80 1 1 0 1 1 1 1 24 1 1 1 0 1 1 0
0 0 0 0 0 0 0 0 17 0 1 16 0 0 0 0 112 7 255 0
0 0 0 1 134 1 1 30 1 1 1 1 8 1 1 1 0 1 0 0
0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 32 0 0 0
0 0 0 0 0 0 1 0 1 1 1 1 0 1 0 0 0 0 0 0

Floating point maths will often not produce accurate results, see Is floating point math broken?.
If we print out the values of indx and indy:
20, 20
20, 19
20, 18
20, 17
20, 15
20, 14
20, 13
20, 13
20, 11
20, 10
20, 9
20, 9
20, 8
20, 6
20, 5
20, 5
20, 3
20, 3
20, 1
20, 1
19, 20
19, 19
19, 18
19, 17
...
You can see that you are writing to indexes with the value 20 which is out of bounds of the array and also you aren't writing to every index leaving some of the array elements uninitialised. Though normally booleans are only true or false they are usually actually stored as a byte allowing storing values between 0 and 255, printing the uninitialised values is undefined behaviour.
We can fix your code in this particular instance by calculating the indexes a little more carefully:
int indx = std::clamp(int(round(ARRAY_SIZE_X - (x - X_MIN) * STEP_SIZE_INV)), 1, ARRAY_SIZE_X)-1;
int indy = std::clamp(int(round(ARRAY_SIZE_Y - (y - Y_MIN) * STEP_SIZE_INV)), 1, ARRAY_SIZE_Y)-1;
There are two fixes here, you were generating values between 1 and 20, the -1 reduces this to 0 to 19. The round solves the issue of not using all the indexes (you were simply truncating by assigning to an int). The clamp ensures the values are always in range (though in this case the calculations work out to be in range).
As you want to always write to every pixel a better solution would be to iterate over the values of indx and indy and calculate the values of x and y from the indices:
for (int indx = 0; indx < ARRAY_SIZE_X; indx++)
{
float x = X_MIN - (indx - ARRAY_SIZE_X) * STEP_SIZE;
for (int indy = 0; indy < ARRAY_SIZE_Y; indy++)
{
float y = Y_MIN - (indy - ARRAY_SIZE_Y) * STEP_SIZE;
result[indx][indy] = func(x, y) < 0;
}
}

time series sliding window with occurrence counts

I am trying to get a count between two timestamped values:
for example:
time letter
1 A
4 B
5 C
9 C
18 B
30 A
30 B
I am dividing time to time windows: 1+ 30 / 30
then I want to know how many A B C in each time window of size 1
timeseries A B C
1 1 0 0
2 0 0 0
...
30 1 1 0
this shoud give me a table of 30 rows and 3 columns: A B C of ocurancess
The problem is the data is taking to long to be break down because it iterates through all master table every time to slice the data eventhough thd data is already sorted
master = mytable
minimum = master.timestamp.min()
maximum = master.timestamp.max()
window = (minimum + maximum) / maximum
wstart = minimum
wend = minimum + window
concurrent_tasks = []
while ( wstart <= maximum ):
As = 0
Bs = 0
Cs = 0
for d, row in master.iterrows():
ttime = row.timestamp
if ((ttime >= wstart) & (ttime < wend)):
#print (row.channel)
if (row.channel == 'A'):
As = As + 1
elif (row.channel == 'B'):
Bs = Bs + 1
elif (row.channel == 'C'):
Cs = Cs + 1
concurrent_tasks.append([m_id, As, Bs, Cs])
wstart = wstart + window
wend = wend + window
Could you help me in making this perform better ? i want to use map function and i want to prevent python from looping through all the loop every time.
This is part of big data and it taking days to finish ?
thank you

There is a faster approach - pd.get_dummies():
In [116]: pd.get_dummies(df.set_index('time')['letter'])
Out[116]:
A B C
time
1 1 0 0
4 0 1 0
5 0 0 1
9 0 0 1
18 0 1 0
30 1 0 0
30 0 1 0
If you want to "compress" (group) it by time:
In [146]: pd.get_dummies(df.set_index('time')['letter']).groupby(level=0).sum()
Out[146]:
A B C
time
1 1 0 0
4 0 1 0
5 0 0 1
9 0 0 1
18 0 1 0
30 1 1 0
or using sklearn.feature_extraction.text.CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(token_pattern=r"\b\w+\b", stop_words=None)
r = pd.SparseDataFrame(cv.fit_transform(df.groupby('time')['letter'].agg(' '.join)),
index=df['time'].unique(),
columns=df['letter'].unique(),
default_fill_value=0)
Result:
In [143]: r
Out[143]:
A B C
1 1 0 0
4 0 1 0
5 0 0 1
9 0 0 1
18 0 1 0
30 1 1 0
If we want to list all times from 1 to 30:
In [153]: r.reindex(np.arange(r.index.min(), r.index.max()+1)).fillna(0).astype(np.int8)
Out[153]:
A B C
1 1 0 0
2 0 0 0
3 0 0 0
4 0 1 0
5 0 0 1
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 1
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
15 0 0 0
16 0 0 0
17 0 0 0
18 0 1 0
19 0 0 0
20 0 0 0
21 0 0 0
22 0 0 0
23 0 0 0
24 0 0 0
25 0 0 0
26 0 0 0
27 0 0 0
28 0 0 0
29 0 0 0
30 1 1 0
or using Pandas approach:
In [159]: pd.get_dummies(df.set_index('time')['letter']) \
...: .groupby(level=0) \
...: .sum() \
...: .reindex(np.arange(r.index.min(), r.index.max()+1), fill_value=0)
...:
Out[159]:
A B C
time
1 1 0 0
2 0 0 0
3 0 0 0
4 0 1 0
5 0 0 1
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 1
10 0 0 0
... .. .. ..
21 0 0 0
22 0 0 0
23 0 0 0
24 0 0 0
25 0 0 0
26 0 0 0
27 0 0 0
28 0 0 0
29 0 0 0
30 1 1 0
[30 rows x 3 columns]
UPDATE:
Timing:
In [163]: df = pd.concat([df] * 10**4, ignore_index=True)
In [164]: %timeit pd.get_dummies(df.set_index('time')['letter'])
100 loops, best of 3: 10.9 ms per loop
In [165]: %timeit df.set_index('time').letter.str.get_dummies()
1 loop, best of 3: 914 ms per loop

How to find TP,TN, FP and FN values from 8x8 Confusion Matrix

I have confusion matrix as follow:
a b c d e f g h <-- classified as
1086 7 1 0 2 4 0 0 | a
7 1064 8 6 0 2 2 0 | b
0 2 1091 2 3 0 1 1 | c
0 8 0 1090 1 1 0 0 | d
1 1 1 1 597 2 2 0 | e
4 2 1 0 3 1089 0 1 | f
0 2 1 3 0 0 219 0 | g
0 0 1 0 1 4 1 443 | h
Now how to find the True positive, True Negative, False Positive and False Negative values from this confusion matrix.
The Weka gave me TP Rate is that same as True positive value ?

You have total 8 classes: a, b, c, d, e, f, g, h. You will thus get 8 different TP, FP, FN, and TN numbers. For instance, in the case of a class,
TP (instance belongs to a, classified as a) = 1086
FP (instance belongs to others, classified as a) = 7 + 0 + 0 + 1 + 4 + 0 + 0 = 12
FN (instance belongs to a, classified as others) = 7 + 1 + 0 + 2 + 4 + 0 + 0 = 14
TN (instance belongs to others, classified as others) = Total instance - (TP + FP + FN)
TP rate is not TP. It is the Recall or TP/TP+FN.

a b c d e f g h <-- classified as
1086 7 1 0 2 4 0 0 | a
7 1064 8 6 0 2 2 0 | b
0 2 1091 2 3 0 1 1 | c
0 8 0 1090 1 1 0 0 | d
1 1 1 1 597 2 2 0 | e
4 2 1 0 3 1089 0 1 | f
0 2 1 3 0 0 219 0 | g
0 0 1 0 1 4 1 0 | h

generating combinations of combinations

I'm trying to generate code which will take the components (i.e, a-f) of various combination permutations (combo) one, two, three, or four units long using these six components and provide various non duplicating combinations of combinations (combo.combo) which contain all of the components (i.e., [ab + cdef and ac + bde + f] but not [ae + bc + df and aef + bc + d]).
It would be nice if this code could allow me to 1) input the number of components, 2) input the min and max unit length per combo, 3) input the min and max number of combos per combo.combo, and 4) randomize the output list of combo.combos.
Maybe start with some kind of iteration loop to generate each version of the 720 possible component combinations (a-f) and then start pruning that list based on the set limiting parameters? I've got some working knowledge of python and will get started, but any tips or suggestions are most welcome.
combo.combo a b c d e f
a.bcdef 1 1 1 1 1 1
ab.cdef 1 1 1 1 1 1
abc.def 1 1 1 1 1 1
abcd.ef 1 1 1 1 1 1
abcde.f 1 1 1 1 1 1
a.b.cdef 1 1 1 1 1 1
a.bc.def 1 1 1 1 1 1
a.bcd.ef 1 1 1 1 1 1
a.bcde.f 1 1 1 1 1 1
ab.c.def 1 1 1 1 1 1
I've found a lot of code which will generate combination permutations but not combinations of combinations. I've included a binary matrix for the combination components, but am stuck on where to proceed from here or if this matrix is a false start (although a helpful visual aide.)
combo a b c d e f
a 1 0 0 0 0 0
b 0 1 0 0 0 0
c 0 0 1 0 0 0
d 0 0 0 1 0 0
e 0 0 0 0 1 0
f 0 0 0 0 0 1
ab 1 1 0 0 0 0
ac 1 0 1 0 0 0
ad 1 0 0 1 0 0
ae 1 0 0 0 1 0
af 1 0 0 0 0 1
bc 0 1 1 0 0 0
bd 0 1 0 1 0 0
be 0 1 0 0 1 0
bf 0 1 0 0 0 1
cd 0 0 1 1 0 0
ce 0 0 1 0 1 0
cf 0 0 1 0 0 1
de 0 0 0 1 1 0
df 0 0 0 1 0 1
ef 0 0 0 0 1 1
abc 1 1 1 0 0 0
abd 1 1 0 1 0 0
abe 1 1 0 0 1 0
abf 1 1 0 0 0 1
acd 1 0 1 1 0 0
ace 1 0 1 0 1 0
acf 1 0 1 0 0 1
ade 1 0 0 1 1 0
adf 1 0 0 1 0 1
aef 1 0 0 0 1 1
bcd 0 1 1 1 0 0
bce 0 1 1 0 1 0
bcf 0 1 1 0 0 1
bde 0 1 0 1 1 0
bdf 0 1 0 1 0 1
bef 0 1 0 0 1 1
cde 0 0 1 1 1 0
cdf 0 0 1 1 0 1
cef 0 0 1 0 1 1
def 0 0 0 1 1 1
abcd 1 1 1 1 0 0
abce 1 1 1 0 1 0
abcf 1 1 1 0 0 1
abde 1 1 0 1 1 0
abdf 1 1 0 1 0 1
abef 1 1 0 0 1 1
acde 1 0 1 1 1 0
acdf 1 0 1 1 0 1
acef 1 0 1 0 1 1
adef 1 0 0 1 1 1
bcde 0 1 1 1 1 0
bcdf 0 1 1 1 0 1
bcef 0 1 1 0 1 1
bdef 0 1 0 1 1 1
cdef 0 0 1 1 1 1

The approach which first comes to mind is this:
generate all the combinations using the given components (which you already did :) )
treat the resulting combinations as a new set of components (so instead of a, b,...,f your set will contain a, ab, abc, ...)
generate all the combinations from the second set
from the new set of combinations only keep those which apply to your condition (it's not very clear from your example what the constraint is)
This, of course, has sky-high exponential complexity, since you'll have to backtrack twice and step 3 has way more possibilities.
It's very possible that there's a more efficient algorithm, starting from the constraint ("non duplicating combinations of combinations which contain all of the components").

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Categorize variable with zero values - stata

Related

Retrieve multiple ArrayFire subarrays from min/max data points

Printing an std::array gives random values

time series sliding window with occurrence counts

How to find TP,TN, FP and FN values from 8x8 Confusion Matrix

generating combinations of combinations

Categories

Resources