Create new columns for the duplicate records:Python - python-2.7

I have an input file that is being generated at runtime of this form:
Case 1:
ID,Numbers,P_ID,Cores,Count
1,1234567890,A1,200,3
2,1234567890,A2,150,3
3,0123459876,A3,1000,1
The generated file can also be of this form:
Case 2:
ID,Numbers,P_ID,Cores,Count
1,1234567890,A1,200,3
3,0123459876,A3,1000,1
Expected Output:
Case 1:
Numbers ID_1 P_ID_1 Cores_1 Count_1 ID_2 P_ID_2 Cores_2 Count_2
0 123459876 3 A3 1000 1 NaN None NaN NaN
1 1234567890 1 A1 200 3 2.0 A2 150.0 3.0
Case 2:
Numbers ID_1 P_ID_1 Cores_1 Count_1 ID_2 P_ID_2 Cores_2 Count_2
0 123459876 3 A3 1000 1 NaN None NaN NaN
1 1234567890 1 A1 200 3 Nan None Nan Nan
In the input file there may be 0 or 1 or 2 rows(but never more that 2) with the same Number(1234567890). These 2 rows, i'm trying to summarize into 1 single row(as shown in output file).
I would like to convert my input file into the above structure.How can i do this? I'm really new to pandas. Please be so kind as to help me with this. Thanks in advance.
In the Case 2:
The structure of output file must remain the same i.e., column names should be same.

I think you need:
first create new column with cumcount for counting Numbers
then reshape by set_index + unstack
MultiIndex in columns is converted to Index with list comprehension
df['g'] = df.groupby('Numbers').cumcount()
df = df.set_index(['Numbers', 'g']).unstack().sort_index(axis=1, level=1)
df.columns = ['_'.join((x[0], str(x[1] + 1))) for x in df.columns]
df = df.reset_index()
print (df)
Numbers ID_1 P_ID_1 Cores_1 Count_1 ID_2 P_ID_2 Cores_2 Count_2
0 123459876 3.0 A3 1000.0 1.0 NaN None NaN NaN
1 1234567890 1.0 A1 200.0 3.0 2.0 A2 150.0 3.0
EDIT:
For converting to int is possible use custom function, which convert only if no error - so columns with NaNs are not changed:
def f(x):
try:
return x.astype(int)
except (TypeError, ValueError):
return x
df['g'] = df.groupby('Numbers').cumcount()
df1 = df.set_index(['Numbers', 'g']).unstack().sort_index(axis=1, level=1)
df1.columns = ['_'.join((x[0], str(x[1] + 1))) for x in df1.columns]
df1 = df1.apply(f).reset_index()
print (df1)
Numbers ID_1 P_ID_1 Cores_1 Count_1 ID_2 P_ID_2 Cores_2 Count_2
0 123459876 3 A3 1000 1 NaN None NaN NaN
1 1234567890 1 A1 200 3 2.0 A2 150.0 3.0
EDIT1:
There has to be 1 or 2 rows per group, so use reindex_axis is possible:
def f(x):
try:
return x.astype(int)
except (TypeError, ValueError):
return x
df['g'] = df.groupby('Numbers').cumcount()
df1 = df.set_index(['Numbers', 'g']).unstack().sort_index(axis=1, level=1)
df1.columns = ['_'.join((x[0], str(x[1] + 1))) for x in df1.columns]
cols = ['ID_1','P_ID_1','Cores_1','Count_1','ID_2','P_ID_2','Cores_2','Count_2']
df1 = df1.apply(f).reindex_axis(cols, axis=1).reset_index()
print (df1)
Numbers ID_1 P_ID_1 Cores_1 Count_1 ID_2 P_ID_2 Cores_2 Count_2
0 123459876 3 A3 1000 1 NaN NaN NaN NaN
1 1234567890 1 A1 200 3 NaN NaN NaN NaN

Related

Counting specific values in rows in CSV

I have a csv file:
A D1 B D2 C D3 E
1 action 0.5 action 0.35 null a
2 0 0.75 0 0.45 action b
3 action 1 action 0.85 action c
I want to count the number of 'action' keyword in each row and make a new column giving the output. So the output file would be something like this.
A D1 B D2 C D3 E TotalAction
1 action 0.5 action 0.35 null a 2
2 0 0.75 0 0.45 action b 1
3 action 1 action 0.85 action c 3
What is the best way to go forward using Pandas? thanks
You could use apply across the rows with str.contains for that keyword:
In [21]: df.apply(lambda x: x.str.contains('action').sum(), axis=1)
Out[21]:
0 2
1 1
2 3
dtype: int64
df['TotalAction'] = df.apply(lambda x: x.str.contains('action').sum(), axis=1)
In [23]: df
Out[23]:
A D1 B D2 C D3 E TotalAction
0 1 action 0.50 action 0.35 null a 2
1 2 0 0.75 0 0.45 action b 1
2 3 action 1.00 action 0.85 action c 3
EDIT
Although you could do it easier and faster with isin and then sum across the rows:
In [45]: df.isin(['action']).sum(axis=1)
Out[45]:
0 2
1 1
2 3
dtype: int64
Note: You need to wrap your string keyword into list.
you can use select_dtypes (for selecting only string columns) in conjunction with .sum(axis=1):
In [95]: df['TotalAction'] = (df.select_dtypes(include=[object]) == 'action').sum(axis=1)
In [96]: df
Out[96]:
A D1 B D2 C D3 E TotalAction
0 1 action 0.50 action 0.35 null a 2
1 2 0 0.75 0 0.45 action b 1
2 3 action 1.00 action 0.85 action c 3
Timing against 30K rows DF:
In [3]: df = pd.concat([df] * 10**4, ignore_index=True)
In [6]: df.shape
Out[6]: (30000, 7)
In [4]: %timeit df.apply(lambda x: x.str.contains('action').sum(), axis=1)
1 loop, best of 3: 7.89 s per loop
In [5]: %timeit (df.select_dtypes(include=[object]) == 'action').sum(axis=1)
100 loops, best of 3: 7.08 ms per loop
In [7]: %timeit df.isin(['action']).sum(axis=1)
10 loops, best of 3: 22.8 ms per loop
Conclusion: apply(...) is 1114 times slower compared to select_dtypes() method
Explanation:
In [92]: df.select_dtypes(include=[object])
Out[92]:
D1 D2 D3 E
0 action action null a
1 0 0 action b
2 action action action c
In [93]: df.select_dtypes(include=[object]) == 'action'
Out[93]:
D1 D2 D3 E
0 True True False False
1 False False True False
2 True True True False
In [94]: (df.select_dtypes(include=[object]) == 'action').sum(axis=1)
Out[94]:
0 2
1 1
2 3
dtype: int64

pandas dataframe update or set column[y] = x where column[z] = 'abc'

I'm new to python pandas and haven't found an answer to this in the documentation. I have an existing dataframe and I've added a new column Y. I want to set the value of column Y to 'abc' in all rows in which column Z = 'xyz'. In sql this would be a simple
update table set colY = 'abc' where colZ = 'xyz'
Is there a similar way to do this update in pandas?
Thanks!
You can use loc or numpy.where if you need set other value too:
df.loc[df.Z == 'xyz', 'Y'] = 'abc'
Sample:
import pandas as pd
import numpy as np
df = pd.DataFrame({'X':[1,2,3],
'Z':['xyz',5,6],
'C':[7,8,9]})
print (df)
C X Z
0 7 1 xyz
1 8 2 5
2 9 3 6
df.loc[df.Z == 'xyz', 'Y'] = 'abc'
print (df)
C X Z Y
0 7 1 xyz abc
1 8 2 5 NaN
2 9 3 6 NaN
df['Y1'] = np.where(df.Z == 'xyz', 'abc', 'klm')
print (df)
C X Z Y Y1
0 7 1 xyz abc abc
1 8 2 5 NaN klm
2 9 3 6 NaN klm
You can use set column values too:
df['Y2'] = np.where(df.Z == 'xyz', 'abc', df.C)
print (df)
C X Z Y Y2
0 7 1 xyz abc abc
1 8 2 5 NaN 8
2 9 3 6 NaN 9

Conditional join Pandas.Dataframe

I try to partially join two dataframes :
import pandas
import numpy
entry1= pandas.datetime(2014,6,1)
entry2= pandas.datetime(2014,6,2)
df1=pandas.DataFrame(numpy.array([[1,1],[2,2],[3,3],[3,3]]), columns=['zick','zack'], index=[entry1, entry1, entry2, entry2])
df2=pandas.DataFrame(numpy.array([[2,3],[3,3]]), columns=['eins','zwei'], index=[entry1, entry2])
I tried
df1 = df1[(df1['zick']>= 2) & (df1['zick'] < 4)].join(df2['eins'])
but this doesn't work. After joining values of df1['eins'] are expected to be [NaN,2,3,3].
How to do it? I'd like to it inplace without df copies.
I think this is what you actually meant to use:
df1 = df1.join(df2['eins'])
mask = (df1['zick']>= 2) & (df1['zick'] < 4)
df1.loc[~mask, 'eins'] = np.nan
df1
yielding:
zick zack eins
2014-06-01 1 1 NaN
2014-06-01 2 2 2
2014-06-02 3 3 3
2014-06-02 3 3 3
Issue you were having is that you were joining filtered dataframe, and not the original one, there was no place for NaN to appear (every cell was satisfying your filter).
EDIT:
Considering new inputs in the comments below, here is another approach.
Create an empty column that will need to be updated with values from second dataframe:
df1['eins'] = np.nan
print df1
print df2
zick zack eins
2014-06-01 1 1 NaN
2014-06-01 2 2 NaN
2014-06-02 3 3 NaN
2014-06-02 3 3 NaN
eins zwei
2014-06-01 2 3
2014-06-02 3 3
Set the filter and make values in the column_to_be_updated satisfying the filter equal to 0.
mask = (df1['zick']>= 2) & (df1['zick'] < 4)
df1.loc[(mask & (df1['eins'].isnull())), 'eins'] = 0
print df1
zick zack eins
2014-06-01 1 1 NaN
2014-06-01 2 2 0
2014-06-02 3 3 0
2014-06-02 3 3 0
Update inplace your df1 with df2 values (only values equal to 0 will be updated):
df1.update(df2, filter_func=lambda x: x == 0)
print df1
zick zack eins
2014-06-01 1 1 NaN
2014-06-01 2 2 2
2014-06-02 3 3 3
2014-06-02 3 3 3
Now if you want to change the filter and do the update again it will not change previously updated values:
mask = (df1['zick']>= 1) & (df1['zick'] == 1)
df1.loc[(mask & (df1['eins'].isnull())), 'eins'] = 0
print df1
zick zack eins
2014-06-01 1 1 0
2014-06-01 2 2 2
2014-06-02 3 3 3
2014-06-02 3 3 3
df1.update(df2, filter_func=lambda x: x == 0)
print df1
zick zack eins
2014-06-01 1 1 2
2014-06-01 2 2 2
2014-06-02 3 3 3
2014-06-02 3 3 3

Get row-index of the last non-NaN value in each column of a pandas data frame

How can I return the row index location of the last non-nan value for each column of the pandas data frame and return the locations as a pandas dataframe?
Use notnull and specifically idxmax to get the index values of the non NaN values
In [22]:
df = pd.DataFrame({'a':[0,1,2,NaN], 'b':[NaN, 1,NaN, 3]})
df
Out[22]:
a b
0 0 NaN
1 1 1
2 2 NaN
3 NaN 3
In [29]:
df[pd.notnull(df)].idxmax()
Out[29]:
a 2
b 3
dtype: int64
EDIT
Actually as correctly pointed out by #Caleb you can use last_valid_index which is designed for this:
In [3]:
df = pd.DataFrame({'a':[3,1,2,np.NaN], 'b':[np.NaN, 1,np.NaN, -1]})
df
Out[3]:
a b
0 3 NaN
1 1 1
2 2 NaN
3 NaN -1
In [6]:
df.apply(pd.Series.last_valid_index)
Out[6]:
a 2
b 3
dtype: int64
If you want the row index of the last non-nan (and non-none) value, here is a one-liner:
>>> df = pd.DataFrame({
'a':[5,1,2,NaN],
'b':[NaN, 6,NaN, 3]})
>>> df
a b
0 5 NaN
1 1 6
2 2 NaN
3 NaN 3
>>> df.apply(lambda column: column.dropna().index[-1])
a 2
b 3
dtype: int64
Explanation:
df.apply in this context applies a function to each column of the dataframe. I am passing it a function that takes as its argument a column, and returns the column's last non-null index.

Pandas read_table using wrong column as index

I'm trying to make a dataframe for a url that is delimited by tabs. However, pandas is using the industry_code column as the index.
dff = pd.read_table('http://download.bls.gov/pub/time.series/ce/ce.industry')
will output
industry_code naics_code publishing_status industry_name display_level selectable sort_sequence
0 - B Total nonfarm 0 T 1 NaN
5000000 - A Total private 1 T 2 NaN
6000000 - A Goods-producing 1 T 3 NaN
7000000 - B Service-providing 1 T 4 NaN
8000000 - A Private service-providing 1 T 5 NaN
Easy!
table_location = 'http://download.bls.gov/pub/time.series/ce/ce.industry'
dff = pd.read_table(table_location, index_col=False)