Merging Pandas rows based on index (time series)

Merging Pandas rows based on index (time series) - python-2.7

I used Pandas .append() to add columns from a number of Pandas timeseries by their index (date). However, instead of combining all data from common dates into one row, the data looks like this:
sve2_all.sort(inplace=True)
print sve2_all['20000101':'20000104']
Hgtot ng/l Q l/s DOC_mg/L Flow_mm/day MeHg ng/l Site \
2000-01-01 NaN NaN NaN 0.18 NaN NaN
2000-01-01 NaN 0.613234 NaN NaN NaN SVE
2000-01-02 NaN NaN NaN 0.18 NaN NaN
2000-01-02 NaN 0.614410 NaN NaN NaN SVE
2000-01-03 NaN NaN NaN NaN NaN 2
2000-01-03 NaN 0.617371 NaN NaN NaN SVE
2000-01-03 NaN NaN NaN NaN NaN NaN
2000-01-03 NaN NaN NaN 0.18 NaN NaN
2000-01-04 NaN 0.627733 NaN NaN NaN SVE
2000-01-04 NaN NaN NaN 0.18 NaN NaN
TOC_filt.TOC TOC_unfilt.TOC Temp oC pH
2000-01-01 NaN NaN NaN NaN
2000-01-01 NaN NaN -12.6117 NaN
2000-01-02 NaN NaN NaN NaN
2000-01-02 NaN NaN -2.3901 NaN
2000-01-03 NaN 8.224648 NaN NaN
2000-01-03 NaN NaN -5.0064 NaN
2000-01-03 NaN NaN NaN NaN
2000-01-03 NaN NaN NaN NaN
2000-01-04 NaN NaN -1.5868 NaN
2000-01-04 NaN NaN NaN NaN
[10 rows x 10 columns]
I've tried to resample this data by day using:
sve2_all.resample('D', how='mean')
And also to group by day using:
sve2_all.groupby(sve2_all.index.map(lambda t: t.day))
However, the DataFrame remains unchanged. How can I collapse the rows for the same date into one date? Thanks.
Additional information: I tried using pd.concat() as suggested by Joris (I had to pass 0 as the axis argument as 1 resulted in ValueError:cannot reindex from a duplicate axis) instead of .append() but the resulting DataFrame is the same as with .append(), a non-uniform non-monotonic time series. I think the index is the problem but I'm not sure what I can do to fix it, I thought that some time stamps might contain hour information while other not so I tried I've also tried using .resample('D',how='mean') on each DataFrame before using .concat() but it didn't make a difference.
Solution: Joris solution was correct, I didn't realise that .resample() wasn't inplace. Once the .resample() was assigned to a new DataFrame Joris' suggestion provided the desired result.

The append method does 'append' the rows to the other dataframe, and does not merge with it based on the index labels. For that you can use concat
Using a toy example:
In [14]: df1 = pd.DataFrame(np.random.randn(3,2), columns=list('AB'), index=pd.date_range('2000-01-01', periods=3))
In [15]: df1
Out[15]:
A B
2000-01-01 1.532085 -1.338895
2000-01-02 -0.016784 -0.270698
2000-01-03 -1.680379 0.838287
In [16]: df2 = pd.DataFrame(np.random.randn(3,2), columns=list('CD'), index=pd.date_range('2000-01-01', periods=3))
In [17]: df2
Out[17]:
C D
2000-01-01 0.375214 -0.812558
2000-01-02 -1.099848 -0.889941
2000-01-03 1.556383 0.870608
.append will append the rows (and columns of df2 that are not in df1 will be added, which is the case here):
In [18]: df1.append(df2)
Out[18]:
A B C D
2000-01-01 1.532085 -1.338895 NaN NaN
2000-01-02 -0.016784 -0.270698 NaN NaN
2000-01-03 -1.680379 0.838287 NaN NaN
2000-01-01 NaN NaN 0.375214 -0.812558
2000-01-02 NaN NaN -1.099848 -0.889941
2000-01-03 NaN NaN 1.556383 0.870608
pd.concat() concatenates the both dataframes along one of the index axises:
In [19]: pd.concat([df1, df2], axis=1)
Out[19]:
A B C D
2000-01-01 1.532085 -1.338895 0.375214 -0.812558
2000-01-02 -0.016784 -0.270698 -1.099848 -0.889941
2000-01-03 -1.680379 0.838287 1.556383 0.870608
Apart from that, the resample should normally work.

Related

How to only add common index pandas data frame? [duplicate]

This question already has answers here:
Adding two Series with NaNs
(3 answers)
Closed 5 years ago.
Suppose I have two data frame. I would like to add both values if there is a common index otherwise take the value. Let me illustrate this with an example
import pandas as pd
In [118]: df1 = pd.DataFrame([1, 2, 3, 4], index=pd.date_range('2018-01-01', periods=4))
In [119]: df2 = pd.DataFrame(10*np.ones_like(df1.values[1:3]), index=df1.index[1:3])
In [120]: df1.add(df2)
Out[120]:
0
2018-01-01 NaN
2018-01-02 12.0
2018-01-03 13.0
2018-01-04 NaN
However, I wanted to get
0
2018-01-01 1.0
2018-01-02 12.0
2018-01-03 13.0
2018-01-04 4.0
How can I achieve this? Moreover, is it even possible if df2.index is not a proper subset of df1.index, i.e. if
df2 = pd.DataFrame(10*np.ones_like(df1.values[1:3]), index=pd.DatetimeIndex([df1.index[1], pd.Timestamp('2019-01-01')]))
In [131]: df2
Out[131]:
0
2018-01-02 10
2019-01-01 10
In [132]: df1.add(df2)
Out[132]:
0
2018-01-01 NaN
2018-01-02 12.0
2018-01-03 NaN
2018-01-04 NaN
2019-01-01 NaN
But what I wanted is
0
2018-01-01 1.0
2018-01-02 12.0
2018-01-03 3.0
2018-01-04 4.0
2019-01-01 10.0

Combine with fillna
df1.add(df2).fillna(df1)
Out[581]:
0
2018-01-01 1.0
2018-01-02 12.0
2018-01-03 13.0
2018-01-04 4.0
Ok,
pd.concat([df1,df2]).sum(level=0)
Out[591]:
0
2018-01-01 1
2018-01-02 12
2018-01-03 3
2018-01-04 4
2019-01-01 10

Pandas column sort order

I am using rolling().agg and adding columns to a dataframe.
def add_mean_std_cols(df):
res = df.rolling(5).agg(['mean','std'])
res.columns = res.columns.map('_'.join)
final = res.join(df).sort_index(axis=1)
return final
np.random.seed(20)
df = pd.DataFrame(np.random.randint(0,9,size=(10, 6)), columns=list('ABCDEF'))
print
print df
print
df.columns = ['A', 'A/B','AB', 'AC', 'C/B', 'D']
print add_mean_std_cols(df)
The issue is the output column name order:
A A/B A/B_mean A/B_std AB AB_mean AB_std AC AC_mean AC_std A_mean A_std C/B C/B_mean C/B_std D D_mean D_std
0 3 4 NaN NaN 6 NaN NaN 7 NaN NaN NaN NaN 2 NaN NaN 0 NaN NaN
1 6 8 NaN NaN 5 NaN NaN 3 NaN NaN NaN NaN 0 NaN NaN 6 NaN NaN
2 6 0 NaN NaN 5 NaN NaN 7 NaN NaN NaN NaN 5 NaN NaN 2 NaN NaN
3 6 3 NaN NaN 3 NaN NaN 0 NaN NaN NaN NaN 6 NaN NaN 2 NaN NaN
4 3 1 3.2 3.114482 8 5.4 1.816590 0 3.4 3.507136 4.8 1.643168 2 3.0 2.449490 7 3.4 2.966479
5 6 6 3.6 3.361547 8 5.8 2.167948 2 2.4 2.880972 5.4 1.341641 1 2.8 2.588436 3 4.0 2.345208
6 2 6 3.2 2.774887 4 5.6 2.302173 6 3.0 3.316625 4.6 1.949359 4 3.6 2.073644 8 4.4 2.880972
7 6 2 3.6 2.302173 3 5.2 2.588436 1 1.8 2.489980 4.6 1.949359 5 3.6 2.073644 2 4.4 2.880972
8 1 8 4.6 2.966479 2 5.0 2.828427 4 2.6 2.408319 3.6 2.302173 4 3.2 1.643168 8 5.6 2.880972
9 6 0 4.4 3.286335 3 4.0 2.345208 4 3.4 1.949359 4.2 2.489980 0 2.8 2.167948 5 5.2 2.774887
For some reason it is sorting A/B and AB before A_mean A_std.
The order that I would prefer is:
A A_mean A_std ...
From playing it seems that '_' is sorted last.
Any suggestions on how to achieve the desired order?
Thanks!

In [60]: res = df.rolling(5).agg(['mean','std'])
In [61]: res.columns = res.columns.map('_'.join)
In [62]: cols = np.concatenate(list(zip(df.columns, res.columns[0::2], res.columns[1::2])))
In [63]: res.join(df).loc[:, cols]
Out[63]:
A A_mean A_std A/B A/B_mean A/B_std AB AB_mean AB_std AC AC_mean AC_std C/B C/B_mean C/B_std D D_mean \
0 3 NaN NaN 4 NaN NaN 6 NaN NaN 7 NaN NaN 2 NaN NaN 0 NaN
1 6 NaN NaN 8 NaN NaN 5 NaN NaN 3 NaN NaN 0 NaN NaN 6 NaN
2 6 NaN NaN 0 NaN NaN 5 NaN NaN 7 NaN NaN 5 NaN NaN 2 NaN
3 6 NaN NaN 3 NaN NaN 3 NaN NaN 0 NaN NaN 6 NaN NaN 2 NaN
4 3 4.8 1.643168 1 3.2 3.114482 8 5.4 1.816590 0 3.4 3.507136 2 3.0 2.449490 7 3.4
5 6 5.4 1.341641 6 3.6 3.361547 8 5.8 2.167948 2 2.4 2.880972 1 2.8 2.588436 3 4.0
6 2 4.6 1.949359 6 3.2 2.774887 4 5.6 2.302173 6 3.0 3.316625 4 3.6 2.073644 8 4.4
7 6 4.6 1.949359 2 3.6 2.302173 3 5.2 2.588436 1 1.8 2.489980 5 3.6 2.073644 2 4.4
8 1 3.6 2.302173 8 4.6 2.966479 2 5.0 2.828427 4 2.6 2.408319 4 3.2 1.643168 8 5.6
9 6 4.2 2.489980 0 4.4 3.286335 3 4.0 2.345208 4 3.4 1.949359 0 2.8 2.167948 5 5.2
D_std
0 NaN
1 NaN
2 NaN
3 NaN
4 2.966479
5 2.345208
6 2.880972
7 2.880972
8 2.880972
9 2.774887

You can join by MultiIndex and then sort_index:
def add_mean_std_cols(df):
res = df.rolling(5).agg(['mean','std'])
df.columns = [df.columns, [''] * len(df.columns)]
final = res.join(df).sort_index(axis=1)
final.columns = final.columns.map('_'.join).str.strip('_')
return final
print (add_mean_std_cols(df))
A A_mean A_std A/B A/B_mean A/B_std AB AB_mean AB_std AC \
0 3 NaN NaN 4 NaN NaN 6 NaN NaN 7
1 6 NaN NaN 8 NaN NaN 5 NaN NaN 3
2 6 NaN NaN 0 NaN NaN 5 NaN NaN 7
3 6 NaN NaN 3 NaN NaN 3 NaN NaN 0
4 3 4.8 1.643168 1 3.2 3.114482 8 5.4 1.816590 0
5 6 5.4 1.341641 6 3.6 3.361547 8 5.8 2.167948 2
6 2 4.6 1.949359 6 3.2 2.774887 4 5.6 2.302173 6
7 6 4.6 1.949359 2 3.6 2.302173 3 5.2 2.588436 1
8 1 3.6 2.302173 8 4.6 2.966479 2 5.0 2.828427 4
9 6 4.2 2.489980 0 4.4 3.286335 3 4.0 2.345208 4
AC_mean AC_std C/B C/B_mean C/B_std D D_mean D_std
0 NaN NaN 2 NaN NaN 0 NaN NaN
1 NaN NaN 0 NaN NaN 6 NaN NaN
2 NaN NaN 5 NaN NaN 2 NaN NaN
3 NaN NaN 6 NaN NaN 2 NaN NaN
4 3.4 3.507136 2 3.0 2.449490 7 3.4 2.966479
5 2.4 2.880972 1 2.8 2.588436 3 4.0 2.345208
6 3.0 3.316625 4 3.6 2.073644 8 4.4 2.880972
7 1.8 2.489980 5 3.6 2.073644 2 4.4 2.880972
8 2.6 2.408319 4 3.2 1.643168 8 5.6 2.880972
9 3.4 1.949359 0 2.8 2.167948 5 5.2 2.774887

Issue Calculating Mean of Grouped Data for entire range of dataset using Pandas

I have a data set of daily temperatures for which I want to calculate 20 year means. The data look like this:
1974 1 1 5.3 4.6 7.3 3.4
1974 1 2 3.3 7.2 4.5 6.5
...
2005 12 364 4.2 5.2 3.3 4.6
2005 12 365 3.1 5.5 2.6 6.8
There is no header in the file but the first column contains the year, the second column the month, and the third column the day of the year. The rest of the columns are temperature data.
I want to calculate the average temperature for each day over a period of 20 years. I thought the best way to do that would be to group the data by day and calculate the mean of each day for a specific range of years. Here is my code:
import pandas as pd
hist_fn = 'tmean_daily_1974_2005.txt'
twenty_year_fn = '20_yr_mean_1974_1993.txt'
start = 1974
end = 1993
hist_mean = pd.read_csv(hist_fn, sep='\s+', header=None)
# Limit dataframe to only the 20 years for which I want the mean calculated
interval_mean = hist_mean[(hist_mean[0]>=start) & (hist_mean[0]<=end)]
# Rename the first column to reflect what mean this file is displaying
interval_mean.iloc[:, 0] = ("%s-%s" % (start, end))
# Generate mean for each day spread across all the years in the dataframe
interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:]
# Write multiyear mean to txt
interval_mean.to_csv(twenty_year_fn, sep='\t', header=False, index=False)
The data set spans longer than 20 years and the method I used has worked for the first 20 year interval but gives me a (mostly) empty text file for any other set of years entered.
So when I use these inputs it works:
start = 1974
end = 1993
and it produces a file that looks like this:
1974-1993 1 1 4.33 5.25 6.84 3.67
1974-1993 1 2 7.23 6.22 5.65 6.23
...
1974-1993 12 364 5.12 4.34 5.21 2.16
1974-1993 12 365 4.81 5.95 3.56 6.78
but when I change the inputs to this:
start = 1975
end = 1994
it produces a .txt file with no temperatures:
1975-1994 1 1
1975-1994 1 2
...
1975-1994 12 364
1975-1994 12 365
I don't understand why this method works for the first 20 year interval but none of the subsequent intervals. Is it something to do with the way the data is organized or how it is being sliced?

Now when that's out of the way, we can talk about the problem you presented:
The strange behavior is due to the fact that pandas matches indices on assignment, and slicing preserves the original indices. That means that when setting
interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:]
Note that interval_mean.groupby(2, as_index=False).mean() has indices 0, ... , 30 (since as_index=False makes the groupby operation create new indices. Otherwise, it would have been the day number).On the other had, interval_mean has the original indices from hist_mean, meaning the first time (first 20 years) it has the indices 0, ..., ~20*365 and the second time is has indices starting from arround 20*365 and counting up.
This is a bit confusing at first, but pandas offer great documentation about it, and people quickly discover why it is so useful.
I'll to explain what happens with an example:
Assume we have the following DataFrame:
df = pd.DataFrame(np.reshape(np.random.randint(5, size=30), [-1,3]))
df
0 1 2
0 1 1 2
1 2 1 1
2 0 1 2
3 0 2 0
4 2 1 0
5 0 1 2
6 2 2 1
7 1 0 2
8 0 1 0
9 1 2 0
Note that the column names are 0,1,2 and the row names (the index) are 0, ..., 9.
When we preform groupby we obtain
df.groupby(0, as_index=False).mean()
0 1 2
0 0 1.250000 1.000000
1 1 1.000000 1.333333
2 2 1.333333 0.666667
(The index equals to the columns grouped by just because draw numbers between 0 to 2). Now, when will do assignments to df.loc, it will replace every cell by the corresponding cell in the assignee, if such cell exists. Otherwise, it will leave NA.
df.loc[:,:] = df.groupby(0, as_index=False).mean()
df
0 1 2
0 0.0 1.250000 1.000000
1 1.0 1.000000 1.333333
2 2.0 1.333333 0.666667
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
And when you write NA to csv, it leaves the cell blank.
The last piece of the puzzle is how interval_mean preserved the original indices, but this is because slicing preserves the original indices:
df[df[1] > 1]
0 1 2
3 0 2 0
6 2 2 1
9 1 2 0

how to draw a multiline chart using python pandas?

Dataframe:
Dept,Date,Que
ece,2015-06-25,96
ece,2015-06-24,89
ece,2015-06-26,88
ece,2015-06-19,87
ece,2015-06-23,82
ece,2015-06-30,82
eee,2015-06-24,73
eee,2015-06-23,71
eee,2015-06-25,70
eee,2015-06-19,66
eee,2015-06-27,60
eee,2015-06-22,56
mech,2015-06-27,10
mech,2015-06-22,8
mech,2015-06-25,8
mech,2015-06-19,7
I need multiline chart with grid based on Dept column, i need each Dept in one line.
For Ex:ece the sparkline should be 96,89,88,87,82,82.... like wise i need for other Dept also.

I think you need pivot and plot:
import matplotlib.pyplot as plt
df = df.pivot(index='Dept', columns='Date', values='Que')
print df
Date 2015-06-19 2015-06-22 2015-06-23 2015-06-24 2015-06-25 2015-06-26 \
Dept
ece 87.0 NaN 82.0 89.0 96.0 88.0
eee 66.0 56.0 71.0 73.0 70.0 NaN
mech 7.0 8.0 NaN NaN 8.0 NaN
Date 2015-06-27 2015-06-30
Dept
ece NaN 82.0
eee 60.0 NaN
mech 10.0 NaN
df.plot()
plt.show()
You can check docs.

Filter Pandas DataFrame by group with tag values

I want to filter a DataFrame by group, since the following nan after a, are supposed to be a (this is something like a tag), and nans followed by b, are also b.
I have a short example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'group1': ['a',nan,nan,nan,nan,'b',nan,nan,nan,nan],
'value1': [0.4,1.1,2,3,4,5,6,7,8,8.8],
'value2': [6.4, 6.9,7.1,8,9,10,11,12,13,14]
})
My desired output would be:
In [3]: df[df.group1 == 'a']
Out[3]:
group1 value1 value2
0 a 0.4 6.4
1 NaN 1.1 6.9
2 NaN 2.0 7.1
3 NaN 3.0 8.0
4 NaN 4.0 9.0
I'll apreciate any hint!

You can use ffill to forward-fill the column:
>>> df[df['group1'].fillna(method='ffill') == 'a']
group1 value1 value2
0 a 0.4 6.4
1 NaN 1.1 6.9
2 NaN 2.0 7.1
3 NaN 3.0 8.0
4 NaN 4.0 9.0
but, perhaps the better solution would be to forward-fill the column on the original data-frame:
>>> df['group1'].fillna(method='ffill', inplace=True)
>>> df[df['group1'] == 'a']

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Merging Pandas rows based on index (time series) - python-2.7

Related

How to only add common index pandas data frame? [duplicate]

Pandas column sort order

Issue Calculating Mean of Grouped Data for entire range of dataset using Pandas

how to draw a multiline chart using python pandas?

Filter Pandas DataFrame by group with tag values

Categories

Resources