How to only add common index pandas data frame? [duplicate] - python-2.7

This question already has answers here:
Adding two Series with NaNs
(3 answers)
Closed 5 years ago.
Suppose I have two data frame. I would like to add both values if there is a common index otherwise take the value. Let me illustrate this with an example
import pandas as pd
In [118]: df1 = pd.DataFrame([1, 2, 3, 4], index=pd.date_range('2018-01-01', periods=4))
In [119]: df2 = pd.DataFrame(10*np.ones_like(df1.values[1:3]), index=df1.index[1:3])
In [120]: df1.add(df2)
Out[120]:
0
2018-01-01 NaN
2018-01-02 12.0
2018-01-03 13.0
2018-01-04 NaN
However, I wanted to get
0
2018-01-01 1.0
2018-01-02 12.0
2018-01-03 13.0
2018-01-04 4.0
How can I achieve this? Moreover, is it even possible if df2.index is not a proper subset of df1.index, i.e. if
df2 = pd.DataFrame(10*np.ones_like(df1.values[1:3]), index=pd.DatetimeIndex([df1.index[1], pd.Timestamp('2019-01-01')]))
In [131]: df2
Out[131]:
0
2018-01-02 10
2019-01-01 10
In [132]: df1.add(df2)
Out[132]:
0
2018-01-01 NaN
2018-01-02 12.0
2018-01-03 NaN
2018-01-04 NaN
2019-01-01 NaN
But what I wanted is
0
2018-01-01 1.0
2018-01-02 12.0
2018-01-03 3.0
2018-01-04 4.0
2019-01-01 10.0

Combine with fillna
df1.add(df2).fillna(df1)
Out[581]:
0
2018-01-01 1.0
2018-01-02 12.0
2018-01-03 13.0
2018-01-04 4.0
Ok,
pd.concat([df1,df2]).sum(level=0)
Out[591]:
0
2018-01-01 1
2018-01-02 12
2018-01-03 3
2018-01-04 4
2019-01-01 10

Related

Pandas column sort order

I am using rolling().agg and adding columns to a dataframe.
def add_mean_std_cols(df):
res = df.rolling(5).agg(['mean','std'])
res.columns = res.columns.map('_'.join)
final = res.join(df).sort_index(axis=1)
return final
np.random.seed(20)
df = pd.DataFrame(np.random.randint(0,9,size=(10, 6)), columns=list('ABCDEF'))
print
print df
print
df.columns = ['A', 'A/B','AB', 'AC', 'C/B', 'D']
print add_mean_std_cols(df)
The issue is the output column name order:
A A/B A/B_mean A/B_std AB AB_mean AB_std AC AC_mean AC_std A_mean A_std C/B C/B_mean C/B_std D D_mean D_std
0 3 4 NaN NaN 6 NaN NaN 7 NaN NaN NaN NaN 2 NaN NaN 0 NaN NaN
1 6 8 NaN NaN 5 NaN NaN 3 NaN NaN NaN NaN 0 NaN NaN 6 NaN NaN
2 6 0 NaN NaN 5 NaN NaN 7 NaN NaN NaN NaN 5 NaN NaN 2 NaN NaN
3 6 3 NaN NaN 3 NaN NaN 0 NaN NaN NaN NaN 6 NaN NaN 2 NaN NaN
4 3 1 3.2 3.114482 8 5.4 1.816590 0 3.4 3.507136 4.8 1.643168 2 3.0 2.449490 7 3.4 2.966479
5 6 6 3.6 3.361547 8 5.8 2.167948 2 2.4 2.880972 5.4 1.341641 1 2.8 2.588436 3 4.0 2.345208
6 2 6 3.2 2.774887 4 5.6 2.302173 6 3.0 3.316625 4.6 1.949359 4 3.6 2.073644 8 4.4 2.880972
7 6 2 3.6 2.302173 3 5.2 2.588436 1 1.8 2.489980 4.6 1.949359 5 3.6 2.073644 2 4.4 2.880972
8 1 8 4.6 2.966479 2 5.0 2.828427 4 2.6 2.408319 3.6 2.302173 4 3.2 1.643168 8 5.6 2.880972
9 6 0 4.4 3.286335 3 4.0 2.345208 4 3.4 1.949359 4.2 2.489980 0 2.8 2.167948 5 5.2 2.774887
For some reason it is sorting A/B and AB before A_mean A_std.
The order that I would prefer is:
A A_mean A_std ...
From playing it seems that '_' is sorted last.
Any suggestions on how to achieve the desired order?
Thanks!
In [60]: res = df.rolling(5).agg(['mean','std'])
In [61]: res.columns = res.columns.map('_'.join)
In [62]: cols = np.concatenate(list(zip(df.columns, res.columns[0::2], res.columns[1::2])))
In [63]: res.join(df).loc[:, cols]
Out[63]:
A A_mean A_std A/B A/B_mean A/B_std AB AB_mean AB_std AC AC_mean AC_std C/B C/B_mean C/B_std D D_mean \
0 3 NaN NaN 4 NaN NaN 6 NaN NaN 7 NaN NaN 2 NaN NaN 0 NaN
1 6 NaN NaN 8 NaN NaN 5 NaN NaN 3 NaN NaN 0 NaN NaN 6 NaN
2 6 NaN NaN 0 NaN NaN 5 NaN NaN 7 NaN NaN 5 NaN NaN 2 NaN
3 6 NaN NaN 3 NaN NaN 3 NaN NaN 0 NaN NaN 6 NaN NaN 2 NaN
4 3 4.8 1.643168 1 3.2 3.114482 8 5.4 1.816590 0 3.4 3.507136 2 3.0 2.449490 7 3.4
5 6 5.4 1.341641 6 3.6 3.361547 8 5.8 2.167948 2 2.4 2.880972 1 2.8 2.588436 3 4.0
6 2 4.6 1.949359 6 3.2 2.774887 4 5.6 2.302173 6 3.0 3.316625 4 3.6 2.073644 8 4.4
7 6 4.6 1.949359 2 3.6 2.302173 3 5.2 2.588436 1 1.8 2.489980 5 3.6 2.073644 2 4.4
8 1 3.6 2.302173 8 4.6 2.966479 2 5.0 2.828427 4 2.6 2.408319 4 3.2 1.643168 8 5.6
9 6 4.2 2.489980 0 4.4 3.286335 3 4.0 2.345208 4 3.4 1.949359 0 2.8 2.167948 5 5.2
D_std
0 NaN
1 NaN
2 NaN
3 NaN
4 2.966479
5 2.345208
6 2.880972
7 2.880972
8 2.880972
9 2.774887
You can join by MultiIndex and then sort_index:
def add_mean_std_cols(df):
res = df.rolling(5).agg(['mean','std'])
df.columns = [df.columns, [''] * len(df.columns)]
final = res.join(df).sort_index(axis=1)
final.columns = final.columns.map('_'.join).str.strip('_')
return final
print (add_mean_std_cols(df))
A A_mean A_std A/B A/B_mean A/B_std AB AB_mean AB_std AC \
0 3 NaN NaN 4 NaN NaN 6 NaN NaN 7
1 6 NaN NaN 8 NaN NaN 5 NaN NaN 3
2 6 NaN NaN 0 NaN NaN 5 NaN NaN 7
3 6 NaN NaN 3 NaN NaN 3 NaN NaN 0
4 3 4.8 1.643168 1 3.2 3.114482 8 5.4 1.816590 0
5 6 5.4 1.341641 6 3.6 3.361547 8 5.8 2.167948 2
6 2 4.6 1.949359 6 3.2 2.774887 4 5.6 2.302173 6
7 6 4.6 1.949359 2 3.6 2.302173 3 5.2 2.588436 1
8 1 3.6 2.302173 8 4.6 2.966479 2 5.0 2.828427 4
9 6 4.2 2.489980 0 4.4 3.286335 3 4.0 2.345208 4
AC_mean AC_std C/B C/B_mean C/B_std D D_mean D_std
0 NaN NaN 2 NaN NaN 0 NaN NaN
1 NaN NaN 0 NaN NaN 6 NaN NaN
2 NaN NaN 5 NaN NaN 2 NaN NaN
3 NaN NaN 6 NaN NaN 2 NaN NaN
4 3.4 3.507136 2 3.0 2.449490 7 3.4 2.966479
5 2.4 2.880972 1 2.8 2.588436 3 4.0 2.345208
6 3.0 3.316625 4 3.6 2.073644 8 4.4 2.880972
7 1.8 2.489980 5 3.6 2.073644 2 4.4 2.880972
8 2.6 2.408319 4 3.2 1.643168 8 5.6 2.880972
9 3.4 1.949359 0 2.8 2.167948 5 5.2 2.774887

Issue Calculating Mean of Grouped Data for entire range of dataset using Pandas

I have a data set of daily temperatures for which I want to calculate 20 year means. The data look like this:
1974 1 1 5.3 4.6 7.3 3.4
1974 1 2 3.3 7.2 4.5 6.5
...
2005 12 364 4.2 5.2 3.3 4.6
2005 12 365 3.1 5.5 2.6 6.8
There is no header in the file but the first column contains the year, the second column the month, and the third column the day of the year. The rest of the columns are temperature data.
I want to calculate the average temperature for each day over a period of 20 years. I thought the best way to do that would be to group the data by day and calculate the mean of each day for a specific range of years. Here is my code:
import pandas as pd
hist_fn = 'tmean_daily_1974_2005.txt'
twenty_year_fn = '20_yr_mean_1974_1993.txt'
start = 1974
end = 1993
hist_mean = pd.read_csv(hist_fn, sep='\s+', header=None)
# Limit dataframe to only the 20 years for which I want the mean calculated
interval_mean = hist_mean[(hist_mean[0]>=start) & (hist_mean[0]<=end)]
# Rename the first column to reflect what mean this file is displaying
interval_mean.iloc[:, 0] = ("%s-%s" % (start, end))
# Generate mean for each day spread across all the years in the dataframe
interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:]
# Write multiyear mean to txt
interval_mean.to_csv(twenty_year_fn, sep='\t', header=False, index=False)
The data set spans longer than 20 years and the method I used has worked for the first 20 year interval but gives me a (mostly) empty text file for any other set of years entered.
So when I use these inputs it works:
start = 1974
end = 1993
and it produces a file that looks like this:
1974-1993 1 1 4.33 5.25 6.84 3.67
1974-1993 1 2 7.23 6.22 5.65 6.23
...
1974-1993 12 364 5.12 4.34 5.21 2.16
1974-1993 12 365 4.81 5.95 3.56 6.78
but when I change the inputs to this:
start = 1975
end = 1994
it produces a .txt file with no temperatures:
1975-1994 1 1
1975-1994 1 2
...
1975-1994 12 364
1975-1994 12 365
I don't understand why this method works for the first 20 year interval but none of the subsequent intervals. Is it something to do with the way the data is organized or how it is being sliced?
Now when that's out of the way, we can talk about the problem you presented:
The strange behavior is due to the fact that pandas matches indices on assignment, and slicing preserves the original indices. That means that when setting
interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:]
Note that interval_mean.groupby(2, as_index=False).mean() has indices 0, ... , 30 (since as_index=False makes the groupby operation create new indices. Otherwise, it would have been the day number).On the other had, interval_mean has the original indices from hist_mean, meaning the first time (first 20 years) it has the indices 0, ..., ~20*365 and the second time is has indices starting from arround 20*365 and counting up.
This is a bit confusing at first, but pandas offer great documentation about it, and people quickly discover why it is so useful.
I'll to explain what happens with an example:
Assume we have the following DataFrame:
df = pd.DataFrame(np.reshape(np.random.randint(5, size=30), [-1,3]))
df
0 1 2
0 1 1 2
1 2 1 1
2 0 1 2
3 0 2 0
4 2 1 0
5 0 1 2
6 2 2 1
7 1 0 2
8 0 1 0
9 1 2 0
Note that the column names are 0,1,2 and the row names (the index) are 0, ..., 9.
When we preform groupby we obtain
df.groupby(0, as_index=False).mean()
0 1 2
0 0 1.250000 1.000000
1 1 1.000000 1.333333
2 2 1.333333 0.666667
(The index equals to the columns grouped by just because draw numbers between 0 to 2). Now, when will do assignments to df.loc, it will replace every cell by the corresponding cell in the assignee, if such cell exists. Otherwise, it will leave NA.
df.loc[:,:] = df.groupby(0, as_index=False).mean()
df
0 1 2
0 0.0 1.250000 1.000000
1 1.0 1.000000 1.333333
2 2.0 1.333333 0.666667
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
And when you write NA to csv, it leaves the cell blank.
The last piece of the puzzle is how interval_mean preserved the original indices, but this is because slicing preserves the original indices:
df[df[1] > 1]
0 1 2
3 0 2 0
6 2 2 1
9 1 2 0

Merging DataFrames within a loop [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have a folder with numerous csv files which look like this:
csv1
2006 Percent Land_Use
0 13 5.379564 Developed
1 8 25.781580 Grass/Pasture
2 4 54.265050 Crop
3 15 0.363983 Water
4 16 6.244104 Wetlands
5 6 4.691764 Forest
6 1 3.031494 Alfalfa
7 11 0.137424 Shrubland
8 5 0.003671 Vetch
9 3 0.055412 Barren
10 7 0.009531 Grass
11 12 0.036423 Tree
csv2
2007 Percent Land_Use
0 13 2.742430 Developed
1 4 56.007242 Crop
2 8 24.227963 Grass/Pasture
3 16 8.839979 Wetlands
4 6 6.181062 Forest
5 1 1.446668 Alfalfa
6 15 0.366116 Water
7 3 0.127760 Barren
8 11 0.034426 Shrubland
9 7 0.000827 Grass
10 12 0.025528 Tree
csv3
2008 Percent Land_Use
0 13 1.863809 Developed
1 8 31.455578 Grass/Pasture
2 4 57.896856 Crop
3 16 2.693929 Wetlands
4 6 4.417966 Forest
5 1 1.239176 Alfalfa
6 7 0.130849 Grass
7 15 0.266536 Water
8 11 0.004571 Shrubland
9 3 0.030731 Barren
and I want to merge them all together into one DataFrame on Land_Use
I am reading in the files like this:
pth = (r'G:\')
for f in os.listdir(pth):
df=pd.read_csv(os.path.join(pth,f)
but I can't figure out how to merge all the individual dataframes after that. I figured out how to concat them but that isn't what I want. The type of merge I want is outer.
If I were to use a pathway to each csv file I would merge them like this, but I do NOT want to set a pathway to each file as there are many of them:
one=pd.read_csv(r'G:\one.csv')
two=pd.read_csv(r'G:\two.csv')
three=pd.read_csv(r'G:\three.csv')
merge=pd.merge(one,two, on=['Land_Use'], how='outer')
mergetwo=pd.merge(merge,three,on=['Land_Use'], how='outer')
I think you can use in python 3:
import functools
dfs = [df1,df2,df3]
df = functools.reduce(lambda left,right: pd.merge(left,right,on='Land_Use',how='outer'),dfs)
print (df)
2006 Percent_x Land_Use 2007 Percent_y 2008 Percent
0 13 5.379564 Developed 13.0 2.742430 13.0 1.863809
1 8 25.781580 Grass/Pasture 8.0 24.227963 8.0 31.455578
2 4 54.265050 Crop 4.0 56.007242 4.0 57.896856
3 15 0.363983 Water 15.0 0.366116 15.0 0.266536
4 16 6.244104 Wetlands 16.0 8.839979 16.0 2.693929
5 6 4.691764 Forest 6.0 6.181062 6.0 4.417966
6 1 3.031494 Alfalfa 1.0 1.446668 1.0 1.239176
7 11 0.137424 Shrubland 11.0 0.034426 11.0 0.004571
8 5 0.003671 Vetch NaN NaN NaN NaN
9 3 0.055412 Barren 3.0 0.127760 3.0 0.030731
10 7 0.009531 Grass 7.0 0.000827 7.0 0.130849
11 12 0.036423 Tree 12.0 0.025528 NaN NaN
In python 2:
df = reduce(lambda left,right: pd.merge(left,right,on='Land_Use',how='outer'),dfs)
Working solution with glob:
import pandas as pd
import functools
import glob
pth = 'a/*.csv'
files = glob.glob(pth)
dfs = [pd.read_csv(f, sep=';') for f in files]
df = functools.reduce(lambda left,right: pd.merge(left,right,on='Land_Use', how='outer'),dfs)
print (df)
2006 Percent_x Land_Use 2008 Percent_y 2007 Percent
0 13 5.379564 Developed 13.0 1.863809 13.0 2.742430
1 8 25.781580 Grass/Pasture 8.0 31.455578 8.0 24.227963
2 4 54.265050 Crop 4.0 57.896856 4.0 56.007242
3 15 0.363983 Water 15.0 0.266536 15.0 0.366116
4 16 6.244104 Wetlands 16.0 2.693929 16.0 8.839979
5 6 4.691764 Forest 6.0 4.417966 6.0 6.181062
6 1 3.031494 Alfalfa 1.0 1.239176 1.0 1.446668
7 11 0.137424 Shrubland 11.0 0.004571 11.0 0.034426
8 5 0.003671 Vetch NaN NaN NaN NaN
9 3 0.055412 Barren 3.0 0.030731 3.0 0.127760
10 7 0.009531 Grass 7.0 0.130849 7.0 0.000827
11 12 0.036423 Tree NaN NaN 12.0 0.025528
I am not allowed to comment, so I am unsure of what you exactly want.
You can try using one.merge(two, on=['Land_Use'], how='outer').merge(three,on=['Land_Use'], how='outer'). Let me know if you wanted something else.
If you have many dataframes, you can try using the reduce function. First create a list containing all the dataframes dataframes = [one, two, three, four, ... , twenty] You can add them into the list by using list comprehensions or by appending them into the list in your loop.
Then if you want to combine them based on Land_Use, you can use df_final = reduce(lambda left,right: pd.merge(left,right,on=['Land_Use'], how='outer'), dataframes)
Note: The reduce function is in the functools package in python 3+

Python 2.7 CSV graph time format

I have a CSV file, one of colons value is timestamps but when I use numpy.getfromtxt it change it to string. My goal is to create a graph but with normal time format I prefer seconds only.
this is my array that I get from bellow code:
array([('0:00:00',), ('0:00:00.001000',), ('0:00:00.002000',),
('0:00:00.081000',), ('0:00:00.095000',), ('0:00:00.195000',),
('0:00:00.294000',), ...
this is my code:
col1 = numpy.genfromtxt("mycsv.csv",usecols=(1),delimiter=',',dtype=None, names=True)
The problem that I am having that format is in string but I need it in seconds (us can be ignored or not). How can I achive that?
If you can, the best way for working with csv files in python is to use pandas. It takes care of this for you. I will assume the name of the time column is time, change it to whatever you use:
>>> import numpy as np
>>> import pandas as pd
>>>
>>> df = pd.read_csv('test.csv', parse_dates=[1]) # read time as date
>>> print(df)
test1 time test2 test3
0 5 2015-08-20 00:00:00.000 10 11.7
1 5 2015-08-20 00:00:00.001 11 11.6
2 5 2015-08-20 00:00:00.002 12 11.5
3 5 2015-08-20 00:00:00.081 13 11.4
4 5 2015-08-20 00:00:00.095 14 11.3
5 5 2015-08-20 00:00:00.195 15 11.2
6 5 2015-08-20 00:00:00.294 16 11.1
>>> df['time'] -= pd.datetime.now().date() # convert to timedelta
>>> print(df)
test1 time test2 test3
0 5 00:00:00 10 11.7
1 5 00:00:00.001000 11 11.6
2 5 00:00:00.002000 12 11.5
3 5 00:00:00.081000 13 11.4
4 5 00:00:00.095000 14 11.3
5 5 00:00:00.195000 15 11.2
6 5 00:00:00.294000 16 11.1
>>> df['time'] /= np.timedelta64(1,'s') # convert to seconds
>>> print(df)
test1 time test2 test3
0 5 0.000 10 11.7
1 5 0.001 11 11.6
2 5 0.002 12 11.5
3 5 0.081 13 11.4
4 5 0.095 14 11.3
5 5 0.195 15 11.2
6 5 0.294 16 11.1
You can work with pandas dataframes (what you have here) and series (what you would from getting a single column, such as df['time']) in most of the same ways as numpy arrays, including plotting. However, if you really, really need to convert it to a numpy array, it is as easy as arr = df['time'].values.
use the datetime library
import datetime
for x in array:
for y .... # it's not realy obvious what the nesting is here...
timestamp = datetime.strptime(y, '%H:%M:%S.%f')
You can use a converter for the timestamp field.
For example, suppose times.dat contains:
time
0:00:00
0:00:00.001000
0:00:00.002000
0:00:00.081000
0:00:00.095000
0:00:00.195000
0:00:00.294000
Define a converter that converts a timestamp string into the number of seconds as a floating point value:
In [5]: def convert_timestamp(s):
...: h, m, s = [float(w) for w in s.split(':')]
...: return h*3600 + m*60 + s
...:
Then use the converter in genfromtxt:
In [21]: genfromtxt('times.dat', skiprows=1, converters={0: convert_timestamp})
Out[21]: array([ 0. , 0.001, 0.002, 0.081, 0.095, 0.195, 0.294])

Filter Pandas DataFrame by group with tag values

I want to filter a DataFrame by group, since the following nan after a, are supposed to be a (this is something like a tag), and nans followed by b, are also b.
I have a short example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'group1': ['a',nan,nan,nan,nan,'b',nan,nan,nan,nan],
'value1': [0.4,1.1,2,3,4,5,6,7,8,8.8],
'value2': [6.4, 6.9,7.1,8,9,10,11,12,13,14]
})
My desired output would be:
In [3]: df[df.group1 == 'a']
Out[3]:
group1 value1 value2
0 a 0.4 6.4
1 NaN 1.1 6.9
2 NaN 2.0 7.1
3 NaN 3.0 8.0
4 NaN 4.0 9.0
I'll apreciate any hint!
You can use ffill to forward-fill the column:
>>> df[df['group1'].fillna(method='ffill') == 'a']
group1 value1 value2
0 a 0.4 6.4
1 NaN 1.1 6.9
2 NaN 2.0 7.1
3 NaN 3.0 8.0
4 NaN 4.0 9.0
but, perhaps the better solution would be to forward-fill the column on the original data-frame:
>>> df['group1'].fillna(method='ffill', inplace=True)
>>> df[df['group1'] == 'a']