Fill missing values of one column from another column in pandas - python-2.7

I have two columns in my pandas dataframe.
I want to fill the missing values of Credit_History column (dtype : int64) with values of Loan_Status column (dtype : int64).

You can try fillna or combine_first:
df.Credit_History = df.Credit_History.fillna(df.Loan_Status)
Or:
df.Credit_History = df.Credit_History.combine_first(df.Loan_Status)
Sample:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Credit_History':[1,2,np.nan, np.nan],
'Loan_Status':[4,5,6,8]})
print (df)
Credit_History Loan_Status
0 1.0 4
1 2.0 5
2 NaN 6
3 NaN 8
df.Credit_History = df.Credit_History.combine_first(df.Loan_Status)
print (df)
Credit_History Loan_Status
0 1.0 4
1 2.0 5
2 6.0 6
3 8.0 8

Related

Python Pandas Data frame creation

I tried to create a data frame df using the below code :
import numpy as np
import pandas as pd
index = [0,1,2,3,4,5]
s = pd.Series([1,2,3,4,5,6],index= index)
t = pd.Series([2,4,6,8,10,12],index= index)
df = pd.DataFrame(s,columns = ["MUL1"])
df["MUL2"] =t
print df
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
While trying to create the same data frame using the below syntax, I am getting a wierd output.
df = pd.DataFrame([s,t],columns = ["MUL1","MUL2"])
print df
MUL1 MUL2
0 NaN NaN
1 NaN NaN
Please explain why the NaN is being displayed in the dataframe when both the Series are non empty and why only two rows are getting displayed and no the rest.
Also provide the correct way to create the data frame same as has been mentioned above by using the columns argument in the pandas DataFrame method.
One of the correct ways would be to stack the array data from the input list holding those series into columns -
In [161]: pd.DataFrame(np.c_[s,t],columns = ["MUL1","MUL2"])
Out[161]:
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
Behind the scenes, the stacking creates a 2D array, which is then converted to a dataframe. Here's what the stacked array looks like -
In [162]: np.c_[s,t]
Out[162]:
array([[ 1, 2],
[ 2, 4],
[ 3, 6],
[ 4, 8],
[ 5, 10],
[ 6, 12]])
If remove columns argument get:
df = pd.DataFrame([s,t])
print (df)
0 1 2 3 4 5
0 1 2 3 4 5 6
1 2 4 6 8 10 12
Then define columns - if columns not exist get NaNs column:
df = pd.DataFrame([s,t], columns=[0,'MUL2'])
print (df)
0 MUL2
0 1.0 NaN
1 2.0 NaN
Better is use dictionary:
df = pd.DataFrame({'MUL1':s,'MUL2':t})
print (df)
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
And if need change columns order add columns parameter:
df = pd.DataFrame({'MUL1':s,'MUL2':t}, columns=['MUL2','MUL1'])
print (df)
MUL2 MUL1
0 2 1
1 4 2
2 6 3
3 8 4
4 10 5
5 12 6
More information is in dataframe documentation.
Another solution by concat - DataFrame constructor is not necessary:
df = pd.concat([s,t], axis=1, keys=['MUL1','MUL2'])
print (df)
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
A pandas.DataFrame takes in the parameter data that can be of type ndarray, iterable, dict, or dataframe.
If you pass in a list it will assume each member is a row. Example:
a = [1,2,3]
b = [2,4,6]
df = pd.DataFrame([a, b], columns = ["Col1","Col2", "Col3"])
# output 1:
Col1 Col2 Col3
0 1 2 3
1 2 4 6
You are getting NaN because it expects index = [0,1] but you are giving [0,1,2,3,4,5]
To get the shape you want, first transpose the data:
data = np.array([a, b]).transpose()
How to create a pandas dataframe
import pandas as pd
a = [1,2,3]
b = [2,4,6]
df = pd.DataFrame(dict(Col1=a, Col2=b))
Output:
Col1 Col2
0 1 2
1 2 4
2 3 6

In pandas, i have 2 columns(ID and Name). If ID is assigned to more than one name how do i replace duplicates with first occurance

Image with the csv file with the two columns
You can use:
df.drop_duplicates('Salesperson_1')
Or maybe need:
df.groupby('Salesperson_1')['Salesperson_1_ID'].transform('first')
Sample:
df = pd.DataFrame({'Salesperson_1':['a','a','b'],
'Salesperson_1_ID':[4,5,6]})
print (df)
Salesperson_1 Salesperson_1_ID
0 a 4
1 a 5
2 b 6
df1 = df.drop_duplicates('Salesperson_1')
print (df1)
Salesperson_1 Salesperson_1_ID
0 a 4
2 b 6
df.Salesperson_1_ID = df.groupby('Salesperson_1')['Salesperson_1_ID'].transform('first')
print (df)
Salesperson_1 Salesperson_1_ID
0 a 4
1 a 4
2 b 6
Pandas.groupby.first()
if your DataFrame is called df, you could just do this:
df.groupby('Salesperson_1_ID').first()

pandas dataframe update or set column[y] = x where column[z] = 'abc'

I'm new to python pandas and haven't found an answer to this in the documentation. I have an existing dataframe and I've added a new column Y. I want to set the value of column Y to 'abc' in all rows in which column Z = 'xyz'. In sql this would be a simple
update table set colY = 'abc' where colZ = 'xyz'
Is there a similar way to do this update in pandas?
Thanks!
You can use loc or numpy.where if you need set other value too:
df.loc[df.Z == 'xyz', 'Y'] = 'abc'
Sample:
import pandas as pd
import numpy as np
df = pd.DataFrame({'X':[1,2,3],
'Z':['xyz',5,6],
'C':[7,8,9]})
print (df)
C X Z
0 7 1 xyz
1 8 2 5
2 9 3 6
df.loc[df.Z == 'xyz', 'Y'] = 'abc'
print (df)
C X Z Y
0 7 1 xyz abc
1 8 2 5 NaN
2 9 3 6 NaN
df['Y1'] = np.where(df.Z == 'xyz', 'abc', 'klm')
print (df)
C X Z Y Y1
0 7 1 xyz abc abc
1 8 2 5 NaN klm
2 9 3 6 NaN klm
You can use set column values too:
df['Y2'] = np.where(df.Z == 'xyz', 'abc', df.C)
print (df)
C X Z Y Y2
0 7 1 xyz abc abc
1 8 2 5 NaN 8
2 9 3 6 NaN 9

How to fill missing data in a data frame based on grouped objects?

I have a dataset with some columns which I am using for grouping the database.I have some other numerical columns in the same dataset with some missing values. I want to fill the missing values of a column with the mean of the group in which the missing entry lies.
Name of Pandas dataset=data
Col on which groups would be based=['A','B']
Col that needs to be imputed with group based means: ['C']
I think you can use groupby with transform:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,1,3],
[1,1,9],
[1,1,np.nan],
[2,2,8],
[2,1,4],
[2,2,np.nan],
[2,2,5]]
, columns=list('ABC'))
print df
A B C
0 1 1 3.0
1 1 1 9.0
2 1 1 NaN
3 2 2 8.0
4 2 1 4.0
5 2 2 NaN
6 2 2 5.0
df['C'] = df.groupby(['A', 'B'])['C'].transform(lambda x: x.fillna( x.mean() ))
print df
A B C
0 1 1 3.0
1 1 1 9.0
2 1 1 6.0
3 2 2 8.0
4 2 1 4.0
5 2 2 6.5
6 2 2 5.0
[df[i].fillna(df[i].mean(),inplace=True) for i in df.columns ]
This fills then NAN from column C with 5.8 which is the mean of columns 'C'
Output
print df
A B C
0 1 1 3.0
1 1 1 9.0
2 1 1 5.8
3 2 2 8.0
4 2 1 4.0
5 2 2 5.8
6 2 2 5.0

Get row-index of the last non-NaN value in each column of a pandas data frame

How can I return the row index location of the last non-nan value for each column of the pandas data frame and return the locations as a pandas dataframe?
Use notnull and specifically idxmax to get the index values of the non NaN values
In [22]:
df = pd.DataFrame({'a':[0,1,2,NaN], 'b':[NaN, 1,NaN, 3]})
df
Out[22]:
a b
0 0 NaN
1 1 1
2 2 NaN
3 NaN 3
In [29]:
df[pd.notnull(df)].idxmax()
Out[29]:
a 2
b 3
dtype: int64
EDIT
Actually as correctly pointed out by #Caleb you can use last_valid_index which is designed for this:
In [3]:
df = pd.DataFrame({'a':[3,1,2,np.NaN], 'b':[np.NaN, 1,np.NaN, -1]})
df
Out[3]:
a b
0 3 NaN
1 1 1
2 2 NaN
3 NaN -1
In [6]:
df.apply(pd.Series.last_valid_index)
Out[6]:
a 2
b 3
dtype: int64
If you want the row index of the last non-nan (and non-none) value, here is a one-liner:
>>> df = pd.DataFrame({
'a':[5,1,2,NaN],
'b':[NaN, 6,NaN, 3]})
>>> df
a b
0 5 NaN
1 1 6
2 2 NaN
3 NaN 3
>>> df.apply(lambda column: column.dropna().index[-1])
a 2
b 3
dtype: int64
Explanation:
df.apply in this context applies a function to each column of the dataframe. I am passing it a function that takes as its argument a column, and returns the column's last non-null index.