Ranking each row to create mask dataframe - python-2.7

Sample dataframe df is defined as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame(10*(2+np.random.randn(500, 8)), columns=list('ABCDEFGH'))
Within each row rank the top 5 columns and mark them as 1 and rest as nan.
df looked like
df.head()
A B C D E F G H
0 6.598436 44.318800 18.064752 13.418329 17.145434 6.696975 14.757765 8.797826
1 3.593140 14.571717 16.292330 28.390669 35.289606 -4.273124 20.519388 25.137833
2 36.777253 34.360523 28.020462 15.356690 22.038938 14.960303 15.225555 34.691981
3 18.623122 27.184421 -5.320215 31.694895 21.156375 9.947077 20.257575 21.035659
4 11.864725 30.458160 13.509029 27.037195 20.581043 25.371691 1.094735 28.703618
Desired output is:
df_output.head()
A B C D E F G H
0 nan 1 1 1 1 nan 1 nan
1 nan nan 1 1 1 nan 1 1
2 1 1 1 nan 1 nan nan 1
3 nan 1 nan 1 1 nan 1 1
4 nan 1 nan 1 1 1 nan 1

df_output = df.rank(1, ascending=False, method='first')
df_output[df_output > 5] = np.nan
df_output[df_output <= 5] = 1.0

Related

I have one table in csv file and want to make pivot table

user item tag weight
1 1 1 1
1 1 2 1
1 1 3 1
1 2 5 1
1 3 2 1
2 1 2 1
2 1 4 1
3 1 2 1
1 1 4 0
when I create pivot table
df = pd.read_csv("test.csv")
index = pd.MultiIndex.from_product([df.user,df.tag])
ratings_df = pd.pivot_table(df, index=['user','tag'] ,columns=['item'], aggfunc=np.max)
print ratings_df
Output:
weight
item 1 2 3
user tag
1 1 1 NaN NaN
2 1 NaN 1
3 1 NaN NaN
5 NaN 1 NaN
2 2 1 NaN NaN
4 1 NaN NaN
3 2 1 NaN NaN
but I want to create pivot table like for every user each tag is display .
In table for user 1 tag 4 is not exist. If tag is not exist it should display tag 4 with 0 entry.
Please help me.

pandas.DataFrame: How to div row by row [python]

I want to div row[i] by row[i+1] in pandas.DataFrame
row[i] = row[i+1] / row[i]
for example:
1 2 3 4
4 2 6 2
8 5 3 1
the result is
0.25 1 0.5 2
0.5 0.4 2 2
You can divide by div shifted DataFrame, last remove NaN row by dropna:
print (df)
a b c d
0 1 2 3 4
1 4 2 6 2
2 8 5 3 1
print (df.div(df.shift(-1), axis=1))
a b c d
0 0.25 1.0 0.5 2.0
1 0.50 0.4 2.0 2.0
2 NaN NaN NaN NaN
df = df.div(df.shift(-1), axis=1).dropna(how='all')
print (df)
a b c d
0 0.25 1.0 0.5 2.0
1 0.50 0.4 2.0 2.0
Another solution for remove last row is select by iloc:
df = df.div(df.shift(-1), axis=1).iloc[:-1]
print (df)
a b c d
0 0.25 1.0 0.5 2.0
1 0.50 0.4 2.0 2.0

Reshape pandas dataframe which has lists as values

I have a pandas dataframe which has lists as values. I would like to transform this dataframe into the format in expected result. The dataframe is too large(1 million rows)
import pandas as pd
import numpy as np
df = pd.DataFrame(
[[['A', 'Second'], [], 'N/A', [6]],
[[2, 3], [3, 4, 6], [3, 4, 5, 7], [2, 6, 3, 4]]],
columns=list('ABCD')
)
df.replace('N/A',np.NaN, inplace=True)
df
A B C D
0 [A,Second] [] NaN [6]
1 [2,3] [3,4,6] [3,4,5,7] [2,6,3,4]
Expected result
0 A A
0 A Second
0 D 6
1 A 2
1 A 3
1 B 3
1 B 4
1 B 6
1 C 3
1 C 4
1 C 5
1 C 7
1 D 2
1 D 6
1 D 3
1 D 4
`
You can use double stack:
df1 = df.stack()
df = pd.DataFrame(df1.values.tolist(), index=df1.index).stack()
.reset_index(level=2,drop=True).reset_index()
df.columns = list('abc')
print (df)
a b c
0 0 A A
1 0 A Second
2 0 D 6
3 1 A 2
4 1 A 3
5 1 B 3
6 1 B 4
7 1 B 6
8 1 C 3
9 1 C 4
10 1 C 5
11 1 C 7
12 1 D 2
13 1 D 6
14 1 D 3
15 1 D 4
df.stack().apply(pd.Series).stack().reset_index(2, True) \
.rename_axis(['a', 'b']).reset_index(name='c')
a b c
0 0 A A
1 0 A Second
2 0 D 6
3 1 A 2
4 1 A 3
5 1 B 3
6 1 B 4
7 1 B 6
8 1 C 3
9 1 C 4
10 1 C 5
11 1 C 7
12 1 D 2
13 1 D 6
14 1 D 3
15 1 D 4

How to filter rows with zero in some columns from the dataframe?

In pandas, how do I filter out rows with value zero in certain columns?
I need to remove those rows, where all its values (except the first column) are zero.
In [70]:
# construct some dummy data
df = pd.DataFrame({'a':randn(5), 'b':[1,2,1,0,0], 'c':[0,0,0,0,0], 'd':[0,0,0,0,1]})
df
Out[70]:
a b c d
0 -1.125360 1 0 0
1 -0.485210 2 0 0
2 -1.461206 1 0 0
3 -0.121767 0 0 0
4 0.168165 0 0 1
In [82]:
# mask where values are not 0
mask = df[df.drop('a', axis=1) != 0]
mask
Out[82]:
a b c d
0 NaN 1 NaN NaN
1 NaN 2 NaN NaN
2 NaN 1 NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN 1
In [94]:
# drop NaN values with a threshold of 1 valid value, and use the index to select those rows
df.loc[mask.dropna(thresh=1).index]
Out[94]:
a b c d
0 -1.125360 1 0 0
1 -0.485210 2 0 0
2 -1.461206 1 0 0
4 0.168165 0 0 1

pandas pivot table using index data of dataframe

I want to create a pivot table from a pandas dataframe
using dataframe.pivot()
and include not only dataframe columns but also the data within the dataframe index.
Couldn't find any docs that show how to do that.
Any tips?
Use reset_index to make the index a column:
In [45]: df = pd.DataFrame({'y': [0, 1, 2, 3, 4, 4], 'x': [1, 2, 2, 3, 1, 3]}, index=np.arange(6)*10)
In [46]: df
Out[46]:
x y
0 1 0
10 2 1
20 2 2
30 3 3
40 1 4
50 3 4
In [47]: df.reset_index()
Out[47]:
index x y
0 0 1 0
1 10 2 1
2 20 2 2
3 30 3 3
4 40 1 4
5 50 3 4
So pivot uses the index as values:
In [48]: df.reset_index().pivot(index='y', columns='x')
Out[48]:
index
x 1 2 3
y
0 0 NaN NaN
1 NaN 10 NaN
2 NaN 20 NaN
3 NaN NaN 30
4 40 NaN 50