Reshaping Pandas data frame (a complex case!) - python-2.7

I want to reshape the following data frame:
index id numbers
1111 5 58.99
2222 5 75.65
1000 4 66.54
11 4 60.33
143 4 62.31
145 51 30.2
1 7 61.28
The reshaped data frame should be like the following:
id 1 2 3
5 58.99 75.65 nan
4 66.54 60.33 62.31
51 30.2 nan nan
7 61.28 nan nan
I use the following code to do this.
import pandas as pd
dtFrame = pd.read_csv("data.csv")
ids = dtFrame['id'].unique()
temp = dtFrame.groupby(['id'])
temp2 = {}
for i in ids:
temp2[i]= temp.get_group(i).reset_index()['numbers']
dtFrame = pd.DataFrame.from_dict(temp2)
dtFrame = dtFrame.T
Although the above code solve my problem but is there a more simple way to achieve this. I tried Pivot table but it does not solve the problem perhaps it requires to have same number of element in each group. Or may be there is another way which I am not aware of, please share your thoughts about it.

In [69]: df.groupby(df['id'])['numbers'].apply(lambda x: pd.Series(x.values)).unstack()
Out[69]:
0 1 2
id
4 66.54 60.33 62.31
5 58.99 75.65 NaN
7 61.28 NaN NaN
51 30.20 NaN NaN
This is really quite similar to what you are doing except that the loop is replaced by apply. The pd.Series(x.values) has an index which by default ranges over integers starting at 0. The index values become the column names (above). It doesn't matter that the various groups may have different lengths. The apply method aligns the various indices for you (and fills missing values with NaN). What a convenience!
I learned this trick here.

Related

how to create combinatorial combination of two files

I did some research but i have difficulties finding an answer.
I am using python 2.7 and pandas so far but i am still learning.
I have two CSVs, let say it's the alphabet A-Z in one and digits in the second one, 0-100.
I want to merge the two files to have A0 to A100 up through Z.
For information the two files have DNA sequence so i believe they are strings.
I tried to create arrays with numpy and create a matrix but to no available..
here is a preview of the files:
barcode
0 GGAAGAA
1 CCAAGAA
2 GAGAGAA
3 AGGAGAA
4 TCGAGAA
5 CTGAGAA
6 CACAGAA
7 TGCAGAA
8 ACCAGAA
9 GTCAGAA
10 CGTAGAA
11 GCTAGAA
12 GAAGGAA
13 AGAGGAA
14 TCAGGAA
659
barcode
0 CGGAAGAA
1 GCGAAGAA
2 GGCAAGAA
3 GGAGAGAA
4 CCAGAGAA
5 GAGGAGAA
6 ACGGAGAA
7 CTGGAGAA
8 CACGAGAA
9 AGCGAGAA
10 TCCGAGAA
11 GTCGAGAA
12 CGTGAGAA
13 GCTGAGAA
14 CGACAGAA
1995
I am putting here the way i found to do it, there might be a sexier way:
index = pd.MultiIndex.from_product([df8.barcode, df7.barcode], names = ["df8", "df7"])
df = pd.DataFrame(index = index).reset_index()
def concat_BC(x):#concatenate the two sequences into one new column
return str(x["df8"]) + str(x["df7"])
df["BC"] = df.apply(concat_BC, axis=1)
– Stephane Chiron

Issue Calculating Mean of Grouped Data for entire range of dataset using Pandas

I have a data set of daily temperatures for which I want to calculate 20 year means. The data look like this:
1974 1 1 5.3 4.6 7.3 3.4
1974 1 2 3.3 7.2 4.5 6.5
...
2005 12 364 4.2 5.2 3.3 4.6
2005 12 365 3.1 5.5 2.6 6.8
There is no header in the file but the first column contains the year, the second column the month, and the third column the day of the year. The rest of the columns are temperature data.
I want to calculate the average temperature for each day over a period of 20 years. I thought the best way to do that would be to group the data by day and calculate the mean of each day for a specific range of years. Here is my code:
import pandas as pd
hist_fn = 'tmean_daily_1974_2005.txt'
twenty_year_fn = '20_yr_mean_1974_1993.txt'
start = 1974
end = 1993
hist_mean = pd.read_csv(hist_fn, sep='\s+', header=None)
# Limit dataframe to only the 20 years for which I want the mean calculated
interval_mean = hist_mean[(hist_mean[0]>=start) & (hist_mean[0]<=end)]
# Rename the first column to reflect what mean this file is displaying
interval_mean.iloc[:, 0] = ("%s-%s" % (start, end))
# Generate mean for each day spread across all the years in the dataframe
interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:]
# Write multiyear mean to txt
interval_mean.to_csv(twenty_year_fn, sep='\t', header=False, index=False)
The data set spans longer than 20 years and the method I used has worked for the first 20 year interval but gives me a (mostly) empty text file for any other set of years entered.
So when I use these inputs it works:
start = 1974
end = 1993
and it produces a file that looks like this:
1974-1993 1 1 4.33 5.25 6.84 3.67
1974-1993 1 2 7.23 6.22 5.65 6.23
...
1974-1993 12 364 5.12 4.34 5.21 2.16
1974-1993 12 365 4.81 5.95 3.56 6.78
but when I change the inputs to this:
start = 1975
end = 1994
it produces a .txt file with no temperatures:
1975-1994 1 1
1975-1994 1 2
...
1975-1994 12 364
1975-1994 12 365
I don't understand why this method works for the first 20 year interval but none of the subsequent intervals. Is it something to do with the way the data is organized or how it is being sliced?
Now when that's out of the way, we can talk about the problem you presented:
The strange behavior is due to the fact that pandas matches indices on assignment, and slicing preserves the original indices. That means that when setting
interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:]
Note that interval_mean.groupby(2, as_index=False).mean() has indices 0, ... , 30 (since as_index=False makes the groupby operation create new indices. Otherwise, it would have been the day number).On the other had, interval_mean has the original indices from hist_mean, meaning the first time (first 20 years) it has the indices 0, ..., ~20*365 and the second time is has indices starting from arround 20*365 and counting up.
This is a bit confusing at first, but pandas offer great documentation about it, and people quickly discover why it is so useful.
I'll to explain what happens with an example:
Assume we have the following DataFrame:
df = pd.DataFrame(np.reshape(np.random.randint(5, size=30), [-1,3]))
df
0 1 2
0 1 1 2
1 2 1 1
2 0 1 2
3 0 2 0
4 2 1 0
5 0 1 2
6 2 2 1
7 1 0 2
8 0 1 0
9 1 2 0
Note that the column names are 0,1,2 and the row names (the index) are 0, ..., 9.
When we preform groupby we obtain
df.groupby(0, as_index=False).mean()
0 1 2
0 0 1.250000 1.000000
1 1 1.000000 1.333333
2 2 1.333333 0.666667
(The index equals to the columns grouped by just because draw numbers between 0 to 2). Now, when will do assignments to df.loc, it will replace every cell by the corresponding cell in the assignee, if such cell exists. Otherwise, it will leave NA.
df.loc[:,:] = df.groupby(0, as_index=False).mean()
df
0 1 2
0 0.0 1.250000 1.000000
1 1.0 1.000000 1.333333
2 2.0 1.333333 0.666667
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
And when you write NA to csv, it leaves the cell blank.
The last piece of the puzzle is how interval_mean preserved the original indices, but this is because slicing preserves the original indices:
df[df[1] > 1]
0 1 2
3 0 2 0
6 2 2 1
9 1 2 0

formula for integer less than 15

In one field I want to accept numbers that could be decimal figure for weight but it should not be over 15. Previously I had the following regex code:
[1-9]\d*(\.\d+)?$
This is to be entered in Google Forms. In other words, all these numbers are OK:
0.05
1.5
2
3.56
But these are not ok:
2 kg
0
15.1
16
This should work for values 0 to 15
^((1[0-5])|([1-9]))?(\.\d*)?$

python pandas replacing column values conditional on string patterns and using split()

long time lurker--I finally stuck to a project involving pandas and more than ever I need your help.
I have a dataframe like the following. Each row describe one retirement formula which may have more than one criteria (hence e1)
index e0 e1
1 62/10 NaN
2 age 55 NaN
3 67/10 age 70
I want to make a column age that describes the minimum age. I've defined patterns for how each criterion is described. For example,
pattern1=r'.*/.*'
pattern7=r'age.[0-9].*'
and I have pattern1-pattern7.
I used the following code to extract age portion of e0 to a new column age:
df['age']=df['e0'][(df['e0'].str.match(pattern1)==1)].apply(lambda x: str(x).split('/')[0])
which gives me
index e0 e1 age
1 62/10 NaN 62
2 age 55 NaN NaN
3 67/10 age 70 67
I want to address other formats such as "age 55" (to extract 55, in this case), but I'm not sure how to go about. If I do
df['age']=df['e0'][(df['e0'].str.match(pattern7)==1)].apply(lambda x: str(x).split(' ')[1])
then it's clearly wrong because I'd overwrite what's already in age and I get
index e0 e1 age
1 62/10 NaN NaN
2 age 55 NaN 55
3 67/10 age 70 NaN
I've tried other variations as far as the syntax would allow me but to no avail.
I'm a Stata user and in Stata, I'd be using replace command conditional on regexm. I'm trying to learn Python and it's been a difficult journey! I'd appreciate any help on this.
I have another (hopefully) quick question in addition: I've used the following two lines to get rid of white space in both e0 and e1.
option['e0']=option['e0'].str.strip()
option['e1']=option['e1'].str.strip()
Is there a way to address them both in one line?
Thanks a lot in advance.
This is a response to your second question (you should stick to one question per post).
df.loc[:, ['e0', 'e1']].apply(lambda x: x.str.strip())
I'm not sure why you are calling the DataFrame 'option' when it was previously referred to as 'df', so I stuck with the latter.
Interesting problem, here I pass a function that removes the NaN values and then calls sum which will concatenate the rows of data.
We can then call the vectorised str method findall with regex \d+ which returns all numbers as a list.
We then apply another function to this that converts the str numbers to ints, puts these in a list and returns the smallest value:
In [37]:
def func(x):
return x.dropna().sum()
​
def lowest(x):
return min(list(map(int,x)))
​
df['min'] = df[['e0','e1']].apply(lambda x: func(x), axis=1).str.findall(r'\d+').apply(lowest)
df
Out[37]:
index e0 e1 min
0 1 62/10 NaN 10
1 2 age 55 NaN 55
2 3 67/10 age 70 10
Breaking the above down so you can see what is happening:
In [38]:
df[['e0','e1']].apply(lambda x: func(x), axis=1)
Out[38]:
0 62/10
1 age 55
2 67/10age 70
dtype: object
In [39]:
df[['e0','e1']].apply(lambda x: func(x), axis=1).str.findall(r'\d+')
Out[39]:
0 [62, 10]
1 [55]
2 [67, 10, 70]
dtype: object

Two Way EntityCollection Binding to a Two Dimension Data Matrix

I have a Day Strucuture Table, which has following Columns I want to display:
DoW HoD Value
1 1 1
1 2 2
1 3 2
1 4 2
1 5 2
1 6 2
1 7 2
1 8 2
1 9 2
1 10 2
1 11 4
1 12 4
1 13 4
1 14 4
1 15 4
1 16 4
1 17 4
1 18 4
1 19 4
1 20 4
1 21 1
1 22 1
1 23 1
1 24 1
Dow is The Day of Week (Monday etc.), HoD is the Hour of Day and Value is the actual value.
Now I want to Bind this Day Structure Entity Collection directly to a Control so any Changes can be bound TwoWay
Like this Format:
I think the best way to achieve this is to use a Template and/or a converter, but I just dont know how ;)
I already read this article, but Lack of a TwoWay Binding functionality makes it not useful for me :(
I Hope you can help me
Jonny
Again i solved it on my own ;)
For this problem i created a Grid with a fixed amout of rows and columns. Inside this Grid I put a Itemscontrol bound to my List of data. Inside the DataTemplate I placed a Textbox bound to the current value and bound the Grid Row and Columnproperties to the Day of the Week/Hour of Day.
Pro:
The Textbox is TwoWay Databound to a certain Object or Element.
Very Easy to implement if Row and Colum Property is numeric.
Con:
Limited to a fixed amout of Rows/Columns.
Very much Code to write in XAML (Copy and Paste)
Kinda "dirty" Code. Feels not like the best way to do it.
Im still open for other suggestions.