onHotEncoding and lists in a pandas dataFrame - list

I have a pandas dataframe:
import pandas as pd
d={'col1':[[1,2,3],[4,5,6]],'col2':[[7,8,9],[10,11,12]]}
df=pd.DataFrame(d)
which results in:
however I want to implement a onHotEncoder, which will treat each list with the cells of the dataFrame as a string, and I want it to treat each value independently.
How would I implement this? My actual dataFrame contains lists of 500 items, and has 4000 unique values.

I think you can use stack for creating Series, then cast list to string by astype, remove [] by strip and last call get_dummies:
df = df.stack().astype(str).str.strip('[]').str.get_dummies(sep=', ')
print (df)
1 10 11 12 2 3 4 5 6 7 8 9
0 col1 1 0 0 0 1 1 0 0 0 0 0 0
col2 0 0 0 0 0 0 0 0 0 1 1 1
1 col1 0 0 0 0 0 0 1 1 1 0 0 0
col2 0 1 1 1 0 0 0 0 0 0 0 0
One column only:
df = df['col1'].astype(str).str.strip('[]').str.get_dummies(sep=', ')
print (df)
1 2 3 4 5 6
0 1 1 1 0 0 0
1 0 0 0 1 1 1

Related

How can I find the shortest path between specific items in a matrix?

I have to find the shortest path between a '1' element of the matrix and a '2' element crossing only throw the '0' elements. I first thought of using the Lee algorithm but it would take to much space given that the matrix can have up to 101 elements.
This is an example of an input
I already know the length of the matrix.
1 0 0 0 2 2 0
0 1 1 0 3 1 3
3 3 3 3 0 0 0
2 0 3 3 0 0 0
2 2 0 3 0 1 1
2 0 0 0 0 1 0
The output is 4 the shortest path being:
1 0 0 0 2 2 0
0 1 1 0 3 1 3
3 3 3 3 0 0 0
2 0 3 3 0 0 0
2 2 0 3 0 1 1
2 * * * * 1 0

time series sliding window with occurrence counts

I am trying to get a count between two timestamped values:
for example:
time letter
1 A
4 B
5 C
9 C
18 B
30 A
30 B
I am dividing time to time windows: 1+ 30 / 30
then I want to know how many A B C in each time window of size 1
timeseries A B C
1 1 0 0
2 0 0 0
...
30 1 1 0
this shoud give me a table of 30 rows and 3 columns: A B C of ocurancess
The problem is the data is taking to long to be break down because it iterates through all master table every time to slice the data eventhough thd data is already sorted
master = mytable
minimum = master.timestamp.min()
maximum = master.timestamp.max()
window = (minimum + maximum) / maximum
wstart = minimum
wend = minimum + window
concurrent_tasks = []
while ( wstart <= maximum ):
As = 0
Bs = 0
Cs = 0
for d, row in master.iterrows():
ttime = row.timestamp
if ((ttime >= wstart) & (ttime < wend)):
#print (row.channel)
if (row.channel == 'A'):
As = As + 1
elif (row.channel == 'B'):
Bs = Bs + 1
elif (row.channel == 'C'):
Cs = Cs + 1
concurrent_tasks.append([m_id, As, Bs, Cs])
wstart = wstart + window
wend = wend + window
Could you help me in making this perform better ? i want to use map function and i want to prevent python from looping through all the loop every time.
This is part of big data and it taking days to finish ?
thank you
There is a faster approach - pd.get_dummies():
In [116]: pd.get_dummies(df.set_index('time')['letter'])
Out[116]:
A B C
time
1 1 0 0
4 0 1 0
5 0 0 1
9 0 0 1
18 0 1 0
30 1 0 0
30 0 1 0
If you want to "compress" (group) it by time:
In [146]: pd.get_dummies(df.set_index('time')['letter']).groupby(level=0).sum()
Out[146]:
A B C
time
1 1 0 0
4 0 1 0
5 0 0 1
9 0 0 1
18 0 1 0
30 1 1 0
or using sklearn.feature_extraction.text.CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(token_pattern=r"\b\w+\b", stop_words=None)
r = pd.SparseDataFrame(cv.fit_transform(df.groupby('time')['letter'].agg(' '.join)),
index=df['time'].unique(),
columns=df['letter'].unique(),
default_fill_value=0)
Result:
In [143]: r
Out[143]:
A B C
1 1 0 0
4 0 1 0
5 0 0 1
9 0 0 1
18 0 1 0
30 1 1 0
If we want to list all times from 1 to 30:
In [153]: r.reindex(np.arange(r.index.min(), r.index.max()+1)).fillna(0).astype(np.int8)
Out[153]:
A B C
1 1 0 0
2 0 0 0
3 0 0 0
4 0 1 0
5 0 0 1
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 1
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
15 0 0 0
16 0 0 0
17 0 0 0
18 0 1 0
19 0 0 0
20 0 0 0
21 0 0 0
22 0 0 0
23 0 0 0
24 0 0 0
25 0 0 0
26 0 0 0
27 0 0 0
28 0 0 0
29 0 0 0
30 1 1 0
or using Pandas approach:
In [159]: pd.get_dummies(df.set_index('time')['letter']) \
...: .groupby(level=0) \
...: .sum() \
...: .reindex(np.arange(r.index.min(), r.index.max()+1), fill_value=0)
...:
Out[159]:
A B C
time
1 1 0 0
2 0 0 0
3 0 0 0
4 0 1 0
5 0 0 1
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 1
10 0 0 0
... .. .. ..
21 0 0 0
22 0 0 0
23 0 0 0
24 0 0 0
25 0 0 0
26 0 0 0
27 0 0 0
28 0 0 0
29 0 0 0
30 1 1 0
[30 rows x 3 columns]
UPDATE:
Timing:
In [163]: df = pd.concat([df] * 10**4, ignore_index=True)
In [164]: %timeit pd.get_dummies(df.set_index('time')['letter'])
100 loops, best of 3: 10.9 ms per loop
In [165]: %timeit df.set_index('time').letter.str.get_dummies()
1 loop, best of 3: 914 ms per loop

SAS sgplot: different symbols and colours by group

The following code produces the picture below.
As you can see, the group statement results in different colours for the data points.
Question: How can I also have different symbols for the two groups?
proc sgplot data=test;
scatter x=time y=Y / group=group;
run;
group time Y
0 0 10085.472039
0 0 10085.472039
0 0 10085.472039
0 1 9950.3642122
0 2 9817.0663279
0 4 9555.8037259
0 6 9301.4941325
0 8 9053.9525066
0 8 9053.9525066
0 8 9053.9525066
1 0 2954.7558871
1 0 2954.7558871
1 0 2954.7558871
1 1 2987.6191302
1 2 3020.8478832
1 4 3088.4182255
1 6 3157.4999815
1 8 3228.1269586
1 8 3228.1269586
1 8 3228.1269586
0 0 3929.2678194
0 0 3929.2678194
0 0 3929.2678194
0 1 3903.7639936
0 2 3878.4257063
0 4 3828.2414563
0 6 3778.7065572
0 8 3729.8126068
0 8 3729.8126068
0 8 3729.8126068
1 0 2694.5952697
1 0 2694.5952697
1 0 2694.5952697
1 1 2580.159876
1 2 2470.5843807
1 4 2265.1962804
1 6 2076.8827929
1 8 1904.2244475
1 8 1904.2244475
1 8 1904.2244475
Using http://www.ats.ucla.edu/stat/sas/faq/gr2grps_new.htm:
symbol1 v=star c=red h=1;
symbol2 v=triangle c=blue h=1;
proc gplot data=temp;
plot y*time=group;
run;
quit;

generating combinations of combinations

I'm trying to generate code which will take the components (i.e, a-f) of various combination permutations (combo) one, two, three, or four units long using these six components and provide various non duplicating combinations of combinations (combo.combo) which contain all of the components (i.e., [ab + cdef and ac + bde + f] but not [ae + bc + df and aef + bc + d]).
It would be nice if this code could allow me to 1) input the number of components, 2) input the min and max unit length per combo, 3) input the min and max number of combos per combo.combo, and 4) randomize the output list of combo.combos.
Maybe start with some kind of iteration loop to generate each version of the 720 possible component combinations (a-f) and then start pruning that list based on the set limiting parameters? I've got some working knowledge of python and will get started, but any tips or suggestions are most welcome.
combo.combo a b c d e f
a.bcdef 1 1 1 1 1 1
ab.cdef 1 1 1 1 1 1
abc.def 1 1 1 1 1 1
abcd.ef 1 1 1 1 1 1
abcde.f 1 1 1 1 1 1
a.b.cdef 1 1 1 1 1 1
a.bc.def 1 1 1 1 1 1
a.bcd.ef 1 1 1 1 1 1
a.bcde.f 1 1 1 1 1 1
ab.c.def 1 1 1 1 1 1
I've found a lot of code which will generate combination permutations but not combinations of combinations. I've included a binary matrix for the combination components, but am stuck on where to proceed from here or if this matrix is a false start (although a helpful visual aide.)
combo a b c d e f
a 1 0 0 0 0 0
b 0 1 0 0 0 0
c 0 0 1 0 0 0
d 0 0 0 1 0 0
e 0 0 0 0 1 0
f 0 0 0 0 0 1
ab 1 1 0 0 0 0
ac 1 0 1 0 0 0
ad 1 0 0 1 0 0
ae 1 0 0 0 1 0
af 1 0 0 0 0 1
bc 0 1 1 0 0 0
bd 0 1 0 1 0 0
be 0 1 0 0 1 0
bf 0 1 0 0 0 1
cd 0 0 1 1 0 0
ce 0 0 1 0 1 0
cf 0 0 1 0 0 1
de 0 0 0 1 1 0
df 0 0 0 1 0 1
ef 0 0 0 0 1 1
abc 1 1 1 0 0 0
abd 1 1 0 1 0 0
abe 1 1 0 0 1 0
abf 1 1 0 0 0 1
acd 1 0 1 1 0 0
ace 1 0 1 0 1 0
acf 1 0 1 0 0 1
ade 1 0 0 1 1 0
adf 1 0 0 1 0 1
aef 1 0 0 0 1 1
bcd 0 1 1 1 0 0
bce 0 1 1 0 1 0
bcf 0 1 1 0 0 1
bde 0 1 0 1 1 0
bdf 0 1 0 1 0 1
bef 0 1 0 0 1 1
cde 0 0 1 1 1 0
cdf 0 0 1 1 0 1
cef 0 0 1 0 1 1
def 0 0 0 1 1 1
abcd 1 1 1 1 0 0
abce 1 1 1 0 1 0
abcf 1 1 1 0 0 1
abde 1 1 0 1 1 0
abdf 1 1 0 1 0 1
abef 1 1 0 0 1 1
acde 1 0 1 1 1 0
acdf 1 0 1 1 0 1
acef 1 0 1 0 1 1
adef 1 0 0 1 1 1
bcde 0 1 1 1 1 0
bcdf 0 1 1 1 0 1
bcef 0 1 1 0 1 1
bdef 0 1 0 1 1 1
cdef 0 0 1 1 1 1
The approach which first comes to mind is this:
generate all the combinations using the given components (which you already did :) )
treat the resulting combinations as a new set of components (so instead of a, b,...,f your set will contain a, ab, abc, ...)
generate all the combinations from the second set
from the new set of combinations only keep those which apply to your condition (it's not very clear from your example what the constraint is)
This, of course, has sky-high exponential complexity, since you'll have to backtrack twice and step 3 has way more possibilities.
It's very possible that there's a more efficient algorithm, starting from the constraint ("non duplicating combinations of combinations which contain all of the components").

How to rearrange vector to be cols not rows?

I am solving systems of equations using Armadillo. I make a matrix from one array of doubles, specifying the rows and columns. The problem is that it doesn't read it the way I make the array, (it's a vector but then converted to an array) so I need to manipulate the vector.
To be clear, it takes a vector with these values:
2 0 0 0 2 1 1 1 0 1 1 0 3 0 0 1 1 1 1 0 0 1 0 1 2
And it makes this matrix:
2 1 1 1 0
0 1 0 1 1
0 1 3 1 0
0 0 0 1 1
2 1 0 0 2
But I want this matrix:
2 0 0 0 2
1 1 1 0 1
1 0 3 0 0
1 1 1 1 0
0 1 0 1 2
How do I manipulate my vector to make it like this?
I feel as if you are looking for a transposition of a matrix. There is relevant documentation here.