python pandas replacing column values conditional on string patterns and using split() - regex

long time lurker--I finally stuck to a project involving pandas and more than ever I need your help.
I have a dataframe like the following. Each row describe one retirement formula which may have more than one criteria (hence e1)
index e0 e1
1 62/10 NaN
2 age 55 NaN
3 67/10 age 70
I want to make a column age that describes the minimum age. I've defined patterns for how each criterion is described. For example,
pattern1=r'.*/.*'
pattern7=r'age.[0-9].*'
and I have pattern1-pattern7.
I used the following code to extract age portion of e0 to a new column age:
df['age']=df['e0'][(df['e0'].str.match(pattern1)==1)].apply(lambda x: str(x).split('/')[0])
which gives me
index e0 e1 age
1 62/10 NaN 62
2 age 55 NaN NaN
3 67/10 age 70 67
I want to address other formats such as "age 55" (to extract 55, in this case), but I'm not sure how to go about. If I do
df['age']=df['e0'][(df['e0'].str.match(pattern7)==1)].apply(lambda x: str(x).split(' ')[1])
then it's clearly wrong because I'd overwrite what's already in age and I get
index e0 e1 age
1 62/10 NaN NaN
2 age 55 NaN 55
3 67/10 age 70 NaN
I've tried other variations as far as the syntax would allow me but to no avail.
I'm a Stata user and in Stata, I'd be using replace command conditional on regexm. I'm trying to learn Python and it's been a difficult journey! I'd appreciate any help on this.
I have another (hopefully) quick question in addition: I've used the following two lines to get rid of white space in both e0 and e1.
option['e0']=option['e0'].str.strip()
option['e1']=option['e1'].str.strip()
Is there a way to address them both in one line?
Thanks a lot in advance.

This is a response to your second question (you should stick to one question per post).
df.loc[:, ['e0', 'e1']].apply(lambda x: x.str.strip())
I'm not sure why you are calling the DataFrame 'option' when it was previously referred to as 'df', so I stuck with the latter.

Interesting problem, here I pass a function that removes the NaN values and then calls sum which will concatenate the rows of data.
We can then call the vectorised str method findall with regex \d+ which returns all numbers as a list.
We then apply another function to this that converts the str numbers to ints, puts these in a list and returns the smallest value:
In [37]:
def func(x):
return x.dropna().sum()
​
def lowest(x):
return min(list(map(int,x)))
​
df['min'] = df[['e0','e1']].apply(lambda x: func(x), axis=1).str.findall(r'\d+').apply(lowest)
df
Out[37]:
index e0 e1 min
0 1 62/10 NaN 10
1 2 age 55 NaN 55
2 3 67/10 age 70 10
Breaking the above down so you can see what is happening:
In [38]:
df[['e0','e1']].apply(lambda x: func(x), axis=1)
Out[38]:
0 62/10
1 age 55
2 67/10age 70
dtype: object
In [39]:
df[['e0','e1']].apply(lambda x: func(x), axis=1).str.findall(r'\d+')
Out[39]:
0 [62, 10]
1 [55]
2 [67, 10, 70]
dtype: object

Related

Data format and pandas

I am using Pandas to format things nicely in a tabular format
data = []
for i in range (start, end_value):
data([i, value])
# modify value in some way
print pd.DataFrame(data)
gives me
0 1
0 38 2.500000e+05
1 39 2.700000e+05
2 40 2.916000e+05
3 41 3.149280e+05
How can I modify this to remove scientific notation and for extra points add thousands separator?
data['column_name'] = data['column_name'].apply('{0:,.2f}'.format)
thanks to John Galt's previous SO answer

Reshaping Pandas data frame (a complex case!)

I want to reshape the following data frame:
index id numbers
1111 5 58.99
2222 5 75.65
1000 4 66.54
11 4 60.33
143 4 62.31
145 51 30.2
1 7 61.28
The reshaped data frame should be like the following:
id 1 2 3
5 58.99 75.65 nan
4 66.54 60.33 62.31
51 30.2 nan nan
7 61.28 nan nan
I use the following code to do this.
import pandas as pd
dtFrame = pd.read_csv("data.csv")
ids = dtFrame['id'].unique()
temp = dtFrame.groupby(['id'])
temp2 = {}
for i in ids:
temp2[i]= temp.get_group(i).reset_index()['numbers']
dtFrame = pd.DataFrame.from_dict(temp2)
dtFrame = dtFrame.T
Although the above code solve my problem but is there a more simple way to achieve this. I tried Pivot table but it does not solve the problem perhaps it requires to have same number of element in each group. Or may be there is another way which I am not aware of, please share your thoughts about it.
In [69]: df.groupby(df['id'])['numbers'].apply(lambda x: pd.Series(x.values)).unstack()
Out[69]:
0 1 2
id
4 66.54 60.33 62.31
5 58.99 75.65 NaN
7 61.28 NaN NaN
51 30.20 NaN NaN
This is really quite similar to what you are doing except that the loop is replaced by apply. The pd.Series(x.values) has an index which by default ranges over integers starting at 0. The index values become the column names (above). It doesn't matter that the various groups may have different lengths. The apply method aligns the various indices for you (and fills missing values with NaN). What a convenience!
I learned this trick here.

Cut off point in k-means clustering in sas

So I want to classify my data into clusters with cut-off point in SAS. The method I use is k-means clustering. (I don't mind about the method, as long as, it gives me 3 groups.)
My code for clustering:
proc fastclus data=maindat outseed=seeds1 maxcluster =3 maxiter=0;
var value resid;
run;
I have the problem with the output result. I want the cut-off point for the Value to be include in the output file. (I don't want the cut-off point for Resid). So is there anyway to do this in SAS?
Edit: As Joe point out, I can't achieve what i'm looking for by using k-mean clustering. So is there another way? Basically, I want a cut-off point so that I can apply it to the another data set.
What I have:
Cluster Value Resid
1 34 11.7668
2 38.9 0.5328
3 42.625 -13.2364
what I want:
Cluster Value Resid Cut-off Value (Interger)
1 34 11.7668 1-36
2 38.9 0.5328 36-40
3 42.625 -13.2364 40-44
My data:
data maindat;
input value Resid ;
datalines;
44 -4.300511714
44 -9.646920963
44 -15.86956805
43 -16.14857235
43 -13.05797186
43 -13.80941206
42 -3.521394503
42 -1.102526302
42 -0.137573583
42 2.669238665
42 -9.540489193
42 -19.27474303
42 -3.527077011
41 1.676464068
41 -2.238822314
41 4.663079037
41 -5.346920963
40 -8.543723186
40 0.507460641
40 0.995302284
40 0.464194011
39 4.728791571
39 5.578685423
38 2.771297564
38 7.109159247
37 15.96059456
37 2.985292226
36 -4.301136971
35 5.854674875
35 5.797294021
34 4.393329025
33 -6.622580905
32 0.268500302
27 12.23062252
;
run;
I don't think you could necessarily do this completely.
k-means clustering uses euclidean distance between all of the variables you provide it. This means that it's not solely using value to cluster observations: it's using Resid as well.
As such, it's possible a row with a value that seems like it should go with cluster 2 should actually go with cluster 3, if the Resid value is much closer there.
In your example, if you request an out dataset, you will see this is true. A proc freq of that out dataset reveals that cluster 1 has three rows, with values 27, 37, and 38. Cluster 2 has almost all of the rows - all but 7 in total - ranging from 32 to 44. Cluster 3 ranges from 40 to 44.
As such, there's no reasonable way to define your clusters the way you ask with this method of clustering. Clusters are typically defined by their centroid, and that's what you get with the outstat dataset; you can determine which cluster a particular value should be assigned based on this.

How to append a new column to my Pandas DataFrame based on a row-based calculation?

Let's say I have a Pandas DataFrame with two columns: 1) user_id, 2) steps (which contains the number of steps on the given date). Now I want to calculate the difference between the number of steps and the number of steps in the preceding measurement (measurements are guaranteed to be in order within my DataFrame).
So basically this comes down to appending an extra column to my DataFrame where the row values of this data frame match the value of the column 'steps' within this same row, minus the value of the 'steps' column in the row above (or 0 if this is the first row). To complicate things further, I want to calculate these differences per user_id, so I want to make sure that I do not subtract the steps values of two rows with different user_id's.
Does anyone have an idea how to get this done with Python 2.7 and Panda?
So an example to illustrate this.
Example input:
user_id steps
1015 48
1015 23
1015 79
1016 10
1016 20
Desired output:
user_id steps d_steps
1015 48 0
1015 23 -25
1015 79 56
2023 10 0
2023 20 10
Your output shows user ids that are not in you orig data but the following does what you want, you will have to replace/fill the NaN values with 0:
In [16]:
df['d_steps'] = df.groupby('user_id').transform('diff')
df.fillna(0, inplace=True)
df
Out[16]:
user_id steps d_steps
0 1015 48 0
1 1015 23 -25
2 1015 79 56
3 1016 10 0
4 1016 20 10
Here we generate the desired column by calling transform on the groupby by object and pass a string which maps to the diff method which subtracts the previous row value. Transform applies a function and returns a series with an index aligned to the df.

Pandas quantile failing with NaN's present

I've encountered an interesting situation while calculating the inter-quartile range. Assuming we have a dataframe such as:
import pandas as pd
index=pd.date_range('2014 01 01',periods=10,freq='D')
data=pd.np.random.randint(0,100,(10,5))
data = pd.DataFrame(index=index,data=data)
data
Out[90]:
0 1 2 3 4
2014-01-01 33 31 82 3 26
2014-01-02 46 59 0 34 48
2014-01-03 71 2 56 67 54
2014-01-04 90 18 71 12 2
2014-01-05 71 53 5 56 65
2014-01-06 42 78 34 54 40
2014-01-07 80 5 76 12 90
2014-01-08 60 90 84 55 78
2014-01-09 33 11 66 90 8
2014-01-10 40 8 35 36 98
# test for q1 values (this works)
data.quantile(0.25)
Out[111]:
0 40.50
1 8.75
2 34.25
3 17.50
4 29.50
# break it by inserting row of nans
data.iloc[-1] = pd.np.NaN
data.quantile(0.25)
Out[115]:
0 42
1 11
2 34
3 12
4 26
The first quartile can be calculated by taking the median of values in the dataframe that fall below the overall median, so we can see what data.quantile(0.25) should have yielded. e.g.
med = data.median()
q1 = data[data<med].median()
q1
Out[119]:
0 37.5
1 8.0
2 19.5
3 12.0
4 17.0
It seems that quantile is failing to provide an appropriate representation of q1 etc. since it is not doing a good job of handling the NaN values (i.e. it works without NaNs, but not with NaNs).
I thought this may not be a "NaN" issue, rather it might be quantile failing to handle even-numbered data sets (i.e. where the median must be calculated as the mean of the two central numbers). However, after testing with dataframes with both even and odd-numbers of rows I saw that quantile handled these situations properly. The problem seems to arise only when NaN values are present in the dataframe.
I would like to use quntile to calculate the rolling q1/q3 values in my dataframe, however, this will not work with NaN's present. Can anyone provide a solution to this issue?
Internally, quantile uses numpy.percentile over the non-null values. When you change the last row of data to NaNs you're essentially left with an array array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.]) in the first column
Calculating np.percentile(array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.]) gives 42.
From the docstring:
Given a vector V of length N, the qth percentile of V is the qth ranked
value in a sorted copy of V. A weighted average of the two nearest
neighbors is used if the normalized ranking does not match q exactly.
The same as the median if q=50, the same as the minimum if q=0
and the same as the maximum if q=100.