I have a csv file:
A D1 B D2 C D3 E
1 action 0.5 action 0.35 null a
2 0 0.75 0 0.45 action b
3 action 1 action 0.85 action c
I want to count the number of 'action' keyword in each row and make a new column giving the output. So the output file would be something like this.
A D1 B D2 C D3 E TotalAction
1 action 0.5 action 0.35 null a 2
2 0 0.75 0 0.45 action b 1
3 action 1 action 0.85 action c 3
What is the best way to go forward using Pandas? thanks
You could use apply across the rows with str.contains for that keyword:
In [21]: df.apply(lambda x: x.str.contains('action').sum(), axis=1)
Out[21]:
0 2
1 1
2 3
dtype: int64
df['TotalAction'] = df.apply(lambda x: x.str.contains('action').sum(), axis=1)
In [23]: df
Out[23]:
A D1 B D2 C D3 E TotalAction
0 1 action 0.50 action 0.35 null a 2
1 2 0 0.75 0 0.45 action b 1
2 3 action 1.00 action 0.85 action c 3
EDIT
Although you could do it easier and faster with isin and then sum across the rows:
In [45]: df.isin(['action']).sum(axis=1)
Out[45]:
0 2
1 1
2 3
dtype: int64
Note: You need to wrap your string keyword into list.
you can use select_dtypes (for selecting only string columns) in conjunction with .sum(axis=1):
In [95]: df['TotalAction'] = (df.select_dtypes(include=[object]) == 'action').sum(axis=1)
In [96]: df
Out[96]:
A D1 B D2 C D3 E TotalAction
0 1 action 0.50 action 0.35 null a 2
1 2 0 0.75 0 0.45 action b 1
2 3 action 1.00 action 0.85 action c 3
Timing against 30K rows DF:
In [3]: df = pd.concat([df] * 10**4, ignore_index=True)
In [6]: df.shape
Out[6]: (30000, 7)
In [4]: %timeit df.apply(lambda x: x.str.contains('action').sum(), axis=1)
1 loop, best of 3: 7.89 s per loop
In [5]: %timeit (df.select_dtypes(include=[object]) == 'action').sum(axis=1)
100 loops, best of 3: 7.08 ms per loop
In [7]: %timeit df.isin(['action']).sum(axis=1)
10 loops, best of 3: 22.8 ms per loop
Conclusion: apply(...) is 1114 times slower compared to select_dtypes() method
Explanation:
In [92]: df.select_dtypes(include=[object])
Out[92]:
D1 D2 D3 E
0 action action null a
1 0 0 action b
2 action action action c
In [93]: df.select_dtypes(include=[object]) == 'action'
Out[93]:
D1 D2 D3 E
0 True True False False
1 False False True False
2 True True True False
In [94]: (df.select_dtypes(include=[object]) == 'action').sum(axis=1)
Out[94]:
0 2
1 1
2 3
dtype: int64
Related
I have an input file that is being generated at runtime of this form:
Case 1:
ID,Numbers,P_ID,Cores,Count
1,1234567890,A1,200,3
2,1234567890,A2,150,3
3,0123459876,A3,1000,1
The generated file can also be of this form:
Case 2:
ID,Numbers,P_ID,Cores,Count
1,1234567890,A1,200,3
3,0123459876,A3,1000,1
Expected Output:
Case 1:
Numbers ID_1 P_ID_1 Cores_1 Count_1 ID_2 P_ID_2 Cores_2 Count_2
0 123459876 3 A3 1000 1 NaN None NaN NaN
1 1234567890 1 A1 200 3 2.0 A2 150.0 3.0
Case 2:
Numbers ID_1 P_ID_1 Cores_1 Count_1 ID_2 P_ID_2 Cores_2 Count_2
0 123459876 3 A3 1000 1 NaN None NaN NaN
1 1234567890 1 A1 200 3 Nan None Nan Nan
In the input file there may be 0 or 1 or 2 rows(but never more that 2) with the same Number(1234567890). These 2 rows, i'm trying to summarize into 1 single row(as shown in output file).
I would like to convert my input file into the above structure.How can i do this? I'm really new to pandas. Please be so kind as to help me with this. Thanks in advance.
In the Case 2:
The structure of output file must remain the same i.e., column names should be same.
I think you need:
first create new column with cumcount for counting Numbers
then reshape by set_index + unstack
MultiIndex in columns is converted to Index with list comprehension
df['g'] = df.groupby('Numbers').cumcount()
df = df.set_index(['Numbers', 'g']).unstack().sort_index(axis=1, level=1)
df.columns = ['_'.join((x[0], str(x[1] + 1))) for x in df.columns]
df = df.reset_index()
print (df)
Numbers ID_1 P_ID_1 Cores_1 Count_1 ID_2 P_ID_2 Cores_2 Count_2
0 123459876 3.0 A3 1000.0 1.0 NaN None NaN NaN
1 1234567890 1.0 A1 200.0 3.0 2.0 A2 150.0 3.0
EDIT:
For converting to int is possible use custom function, which convert only if no error - so columns with NaNs are not changed:
def f(x):
try:
return x.astype(int)
except (TypeError, ValueError):
return x
df['g'] = df.groupby('Numbers').cumcount()
df1 = df.set_index(['Numbers', 'g']).unstack().sort_index(axis=1, level=1)
df1.columns = ['_'.join((x[0], str(x[1] + 1))) for x in df1.columns]
df1 = df1.apply(f).reset_index()
print (df1)
Numbers ID_1 P_ID_1 Cores_1 Count_1 ID_2 P_ID_2 Cores_2 Count_2
0 123459876 3 A3 1000 1 NaN None NaN NaN
1 1234567890 1 A1 200 3 2.0 A2 150.0 3.0
EDIT1:
There has to be 1 or 2 rows per group, so use reindex_axis is possible:
def f(x):
try:
return x.astype(int)
except (TypeError, ValueError):
return x
df['g'] = df.groupby('Numbers').cumcount()
df1 = df.set_index(['Numbers', 'g']).unstack().sort_index(axis=1, level=1)
df1.columns = ['_'.join((x[0], str(x[1] + 1))) for x in df1.columns]
cols = ['ID_1','P_ID_1','Cores_1','Count_1','ID_2','P_ID_2','Cores_2','Count_2']
df1 = df1.apply(f).reindex_axis(cols, axis=1).reset_index()
print (df1)
Numbers ID_1 P_ID_1 Cores_1 Count_1 ID_2 P_ID_2 Cores_2 Count_2
0 123459876 3 A3 1000 1 NaN NaN NaN NaN
1 1234567890 1 A1 200 3 NaN NaN NaN NaN
I am using a pandas dataframe and I want to delete observations with the same name after they met the condition (cond=1).
My dataset looks like:
person med cond
A a 0
A b 0
A a 1
A d 0
A e 0
B a 0
B c 1
C e 1
C f 0
D a 0
D f 0
I want to get this:
person med cond
A a 0
A b 0
A a 1
B a 0
B c 1
C e 1
D a 0
D f 0
I want the code to first check if the next person has the same name, then check if the condition is met (cond=1) and if so drop all the next lines with the same name.
Can someone help me with this?
You can groupby on the df and then reference the col of interest in the lambda and then call reset_index(drop=True) to remove the redundant index:
In [38]:
df.groupby('person').apply( lambda x: x.loc[:x['cond'].idxmax()] if len(x[x['cond']==0]) != len(x) else x).reset_index(drop=True)
Out[38]:
person med cond
0 A a 0
1 A b 0
2 A a 1
3 B a 0
4 B c 1
5 C e 1
6 D a 0
7 D f 0
I have a data frame like follow:
pop state value1 value2
0 1.8 Ohio 2000001 2100345
1 1.9 Ohio 2001001 1000524
2 3.9 Nevada 2002100 1000242
3 2.9 Nevada 2001003 1234567
4 2.0 Nevada 2002004 1420000
And I have a ordered dictionary like following:
OrderedDict([(1, OrderedDict([('value1_1', [1, 2]),('value1_2', [3, 4]),('value1_3',[5,7])])),(1, OrderedDict([('value2_1', [1, 1]),('value2_2', [2, 5]),('value2_3',[6,7])]))])
I want to changed the data frame as the OrderedDict needed.
pop state value1_1 value1_2 value1_3 value2_1 value2_2 value2_3
0 1.8 Ohio 20 0 1 2 1003 45
1 1.9 Ohio 20 1 1 1 5 24
2 3.9 Nevada 20 2 100 1 2 42
3 2.9 Nevada 20 1 3 1 2345 67
4 2.0 Nevada 20 2 4 1 4200 0
I think it is really a complex logic in python pandas. How can I solve it? Thanks.
First, your OrderedDict overwrites the same key, you need to use different keys.
d= OrderedDict([(1, OrderedDict([('value1_1', [1, 2]),('value1_2', [3, 4]),('value1_3',[5,7])])),(2, OrderedDict([('value2_1', [1, 1]),('value2_2', [2, 5]),('value2_3',[6,7])]))])
Now, for your actual problem, you can iterate through d to get the items, and use the apply function on the DataFrame to get what you need.
for k,v in d.items():
for k1,v1 in v.items():
if k == 1:
df[k1] = df.value1.apply(lambda x : int(str(x)[v1[0]-1:v1[1]]))
else:
df[k1] = df.value2.apply(lambda x : int(str(x)[v1[0]-1:v1[1]]))
Now, df is
pop state value1 value2 value1_1 value1_2 value1_3 value2_1 \
0 1.8 Ohio 2000001 2100345 20 0 1 2
1 1.9 Ohio 2001001 1000524 20 1 1 1
2 3.9 Nevada 2002100 1000242 20 2 100 1
3 2.9 Nevada 2001003 1234567 20 1 3 1
4 2.0 Nevada 2002004 1420000 20 2 4 1
value2_2 value2_3
0 1003 45
1 5 24
2 2 42
3 2345 67
4 4200 0
I think this would point you in the right direction.
Converting the value1 and value2 columns to string type:
df['value1'], df['value2'] = df['value1'].astype(str), df['value2'].astype(str)
dct_1,dct_2 = OrderedDict([('value1_1', [1, 2]),('value1_2', [3, 4]),('value1_3',[5,7])]),
OrderedDict([('value2_1', [1, 1]),('value2_2', [2, 5]),('value2_3',[6,7])])
Converting Ordered Dictionary to a list of tuples:
dct_1_list, dct_2_list = list(dct_1.items()), list(dct_2.items())
Flattening a list of lists to a single list:
L1, L2 = sum(list(x[1] for x in dct_1_list), []), sum(list(x[1] for x in dct_2_list), [])
Subtracting the even slices of the list by 1 as the string indices start from 0 and not 1:
L1[::2], L2[::2] = np.array(L1[0::2]) - np.array([1]), np.array(L2[0::2]) - np.array([1])
Taking the appropriate slice positions and mapping those values to the newly created columns of the dataframe:
df['value1_1'],df['value1_2'],df['value1_3']= map(df['value1'].str.slice,L1[::2],L1[1::2])
df['value2_1'],df['value2_2'],df['value2_3']= map(df['value2'].str.slice,L2[::2],L2[1::2])
Dropping off unwanted columns:
df.drop(['value1', 'value2'], axis=1, inplace=True)
Final result:
print(df)
pop state value1_1 value1_2 value1_3 value2_1 value2_2 value2_3
0 1.8 Ohio 20 00 001 2 1003 45
1 1.9 Ohio 20 01 001 1 0005 24
2 3.9 Nevada 20 02 100 1 0002 42
3 2.9 Nevada 20 01 003 1 2345 67
4 2.0 Nevada 20 02 004 1 4200 00
How can I return the row index location of the last non-nan value for each column of the pandas data frame and return the locations as a pandas dataframe?
Use notnull and specifically idxmax to get the index values of the non NaN values
In [22]:
df = pd.DataFrame({'a':[0,1,2,NaN], 'b':[NaN, 1,NaN, 3]})
df
Out[22]:
a b
0 0 NaN
1 1 1
2 2 NaN
3 NaN 3
In [29]:
df[pd.notnull(df)].idxmax()
Out[29]:
a 2
b 3
dtype: int64
EDIT
Actually as correctly pointed out by #Caleb you can use last_valid_index which is designed for this:
In [3]:
df = pd.DataFrame({'a':[3,1,2,np.NaN], 'b':[np.NaN, 1,np.NaN, -1]})
df
Out[3]:
a b
0 3 NaN
1 1 1
2 2 NaN
3 NaN -1
In [6]:
df.apply(pd.Series.last_valid_index)
Out[6]:
a 2
b 3
dtype: int64
If you want the row index of the last non-nan (and non-none) value, here is a one-liner:
>>> df = pd.DataFrame({
'a':[5,1,2,NaN],
'b':[NaN, 6,NaN, 3]})
>>> df
a b
0 5 NaN
1 1 6
2 2 NaN
3 NaN 3
>>> df.apply(lambda column: column.dropna().index[-1])
a 2
b 3
dtype: int64
Explanation:
df.apply in this context applies a function to each column of the dataframe. I am passing it a function that takes as its argument a column, and returns the column's last non-null index.
Having the code (below) I am trying to figure will particular group order always remain the same as in original dataframe.
It looks like the order within the group preserved for my little example, but what if I have dataframe with ~1 mln records? Will pandas provide such guarantee and I should worry about that by myself?
Code:
import numpy as np
import pandas as pd
N = 10
df = pd.DataFrame(index = xrange(N))
df['A'] = map(lambda x: int(x) / 5, np.random.randn(N) * 10.0)
df['B'] = map(lambda x: int(x) / 5, np.random.randn(N) * 10.0)
df['v'] = np.random.randn(N)
def show_x(x):
print x
print "----------------"
df.groupby('A').apply(show_x)
print "==============="
print df
Output:
A B v
6 -4 -1 -2.047354
[1 rows x 3 columns]
----------------
A B v
6 -4 -1 -2.047354
[1 rows x 3 columns]
----------------
A B v
8 -3 0 -1.190831
[1 rows x 3 columns]
----------------
A B v
0 -1 -1 0.456397
9 -1 -2 -1.329169
[2 rows x 3 columns]
----------------
A B v
1 0 0 0.663928
2 0 2 0.626204
7 0 -3 -0.539166
[3 rows x 3 columns]
----------------
A B v
4 2 2 -1.115721
5 2 1 -1.905266
[2 rows x 3 columns]
----------------
A B v
3 4 -1 0.751016
[1 rows x 3 columns]
----------------
===============
A B v
0 -1 -1 0.456397
1 0 0 0.663928
2 0 2 0.626204
3 4 -1 0.751016
4 2 2 -1.115721
5 2 1 -1.905266
6 -4 -1 -2.047354
7 0 -3 -0.539166
8 -3 0 -1.190831
9 -1 -2 -1.329169
[10 rows x 3 columns]
If you are using apply not only is the order not guaranteed, but as you've found it can trigger the function for the same group a couple of times (to decide which "path" to take / what type of result to return). So if your function has side-effects don't do this!
I recommend simply iterating through the groupby object!
In [11]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
In [12]: df
Out[12]:
A B
0 1 2
1 1 4
2 5 6
In [13]: g = df.groupby('A')
In [14]: for key, sub_df in g:
print("key =", key)
print(sub_df)
print('') # apply whatever function you want
key = 1
A B
0 1 2
1 1 4
key = 5
A B
2 5 6
Note that this is ordered (the same as the levels) see g.grouper._get_group_keys():
In [21]: g.grouper.levels
Out[21]: [Int64Index([1, 5], dtype='int64')]
It's sorted by default (there's a sort kwarg when doing the groupby), through it's not clear what this actually means if it's not a numeric dtype.