Merging DataFrames within a loop [duplicate] - python-2.7

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have a folder with numerous csv files which look like this:
csv1
2006 Percent Land_Use
0 13 5.379564 Developed
1 8 25.781580 Grass/Pasture
2 4 54.265050 Crop
3 15 0.363983 Water
4 16 6.244104 Wetlands
5 6 4.691764 Forest
6 1 3.031494 Alfalfa
7 11 0.137424 Shrubland
8 5 0.003671 Vetch
9 3 0.055412 Barren
10 7 0.009531 Grass
11 12 0.036423 Tree
csv2
2007 Percent Land_Use
0 13 2.742430 Developed
1 4 56.007242 Crop
2 8 24.227963 Grass/Pasture
3 16 8.839979 Wetlands
4 6 6.181062 Forest
5 1 1.446668 Alfalfa
6 15 0.366116 Water
7 3 0.127760 Barren
8 11 0.034426 Shrubland
9 7 0.000827 Grass
10 12 0.025528 Tree
csv3
2008 Percent Land_Use
0 13 1.863809 Developed
1 8 31.455578 Grass/Pasture
2 4 57.896856 Crop
3 16 2.693929 Wetlands
4 6 4.417966 Forest
5 1 1.239176 Alfalfa
6 7 0.130849 Grass
7 15 0.266536 Water
8 11 0.004571 Shrubland
9 3 0.030731 Barren
and I want to merge them all together into one DataFrame on Land_Use
I am reading in the files like this:
pth = (r'G:\')
for f in os.listdir(pth):
df=pd.read_csv(os.path.join(pth,f)
but I can't figure out how to merge all the individual dataframes after that. I figured out how to concat them but that isn't what I want. The type of merge I want is outer.
If I were to use a pathway to each csv file I would merge them like this, but I do NOT want to set a pathway to each file as there are many of them:
one=pd.read_csv(r'G:\one.csv')
two=pd.read_csv(r'G:\two.csv')
three=pd.read_csv(r'G:\three.csv')
merge=pd.merge(one,two, on=['Land_Use'], how='outer')
mergetwo=pd.merge(merge,three,on=['Land_Use'], how='outer')

I think you can use in python 3:
import functools
dfs = [df1,df2,df3]
df = functools.reduce(lambda left,right: pd.merge(left,right,on='Land_Use',how='outer'),dfs)
print (df)
2006 Percent_x Land_Use 2007 Percent_y 2008 Percent
0 13 5.379564 Developed 13.0 2.742430 13.0 1.863809
1 8 25.781580 Grass/Pasture 8.0 24.227963 8.0 31.455578
2 4 54.265050 Crop 4.0 56.007242 4.0 57.896856
3 15 0.363983 Water 15.0 0.366116 15.0 0.266536
4 16 6.244104 Wetlands 16.0 8.839979 16.0 2.693929
5 6 4.691764 Forest 6.0 6.181062 6.0 4.417966
6 1 3.031494 Alfalfa 1.0 1.446668 1.0 1.239176
7 11 0.137424 Shrubland 11.0 0.034426 11.0 0.004571
8 5 0.003671 Vetch NaN NaN NaN NaN
9 3 0.055412 Barren 3.0 0.127760 3.0 0.030731
10 7 0.009531 Grass 7.0 0.000827 7.0 0.130849
11 12 0.036423 Tree 12.0 0.025528 NaN NaN
In python 2:
df = reduce(lambda left,right: pd.merge(left,right,on='Land_Use',how='outer'),dfs)
Working solution with glob:
import pandas as pd
import functools
import glob
pth = 'a/*.csv'
files = glob.glob(pth)
dfs = [pd.read_csv(f, sep=';') for f in files]
df = functools.reduce(lambda left,right: pd.merge(left,right,on='Land_Use', how='outer'),dfs)
print (df)
2006 Percent_x Land_Use 2008 Percent_y 2007 Percent
0 13 5.379564 Developed 13.0 1.863809 13.0 2.742430
1 8 25.781580 Grass/Pasture 8.0 31.455578 8.0 24.227963
2 4 54.265050 Crop 4.0 57.896856 4.0 56.007242
3 15 0.363983 Water 15.0 0.266536 15.0 0.366116
4 16 6.244104 Wetlands 16.0 2.693929 16.0 8.839979
5 6 4.691764 Forest 6.0 4.417966 6.0 6.181062
6 1 3.031494 Alfalfa 1.0 1.239176 1.0 1.446668
7 11 0.137424 Shrubland 11.0 0.004571 11.0 0.034426
8 5 0.003671 Vetch NaN NaN NaN NaN
9 3 0.055412 Barren 3.0 0.030731 3.0 0.127760
10 7 0.009531 Grass 7.0 0.130849 7.0 0.000827
11 12 0.036423 Tree NaN NaN 12.0 0.025528

I am not allowed to comment, so I am unsure of what you exactly want.
You can try using one.merge(two, on=['Land_Use'], how='outer').merge(three,on=['Land_Use'], how='outer'). Let me know if you wanted something else.
If you have many dataframes, you can try using the reduce function. First create a list containing all the dataframes dataframes = [one, two, three, four, ... , twenty] You can add them into the list by using list comprehensions or by appending them into the list in your loop.
Then if you want to combine them based on Land_Use, you can use df_final = reduce(lambda left,right: pd.merge(left,right,on=['Land_Use'], how='outer'), dataframes)
Note: The reduce function is in the functools package in python 3+

Related

cumulative average powerbi by month

I have below dataset.
Math Literature Biology date student
4 2 5 2019-08-25 A
4 5 4 2019-08-08 A
5 4 5 2019-08-23 A
5 5 5 2019-08-15 A
5 5 5 2019-07-19 A
5 5 5 2019-07-15 A
5 5 5 2019-07-03 A
5 5 5 2019-06-26 A
1 1 2 2019-06-18 A
2 3 3 2019-06-14 A
5 5 5 2019-05-01 A
2 1 3 2019-04-26 A
I need to develop a solution in powerbi so in output I have cumulative average per subject per month
For example
April May June July August
Math | 2 3.5 3 3.75 4
Literature | 1 3 3 3.75 3.83
Biology | 3 4 3.6 4.125 4.33
Can you help?
You can use a matrix visualization for this.
Create a month-year variable and use it in the columns.
Use Average of Math,Literature and Biology in values
Under the format pane --> Values --> Show on rows --> Select this
This should give the view you are looking for. You can edit the value headers to your requirement.

How to only add common index pandas data frame? [duplicate]

This question already has answers here:
Adding two Series with NaNs
(3 answers)
Closed 5 years ago.
Suppose I have two data frame. I would like to add both values if there is a common index otherwise take the value. Let me illustrate this with an example
import pandas as pd
In [118]: df1 = pd.DataFrame([1, 2, 3, 4], index=pd.date_range('2018-01-01', periods=4))
In [119]: df2 = pd.DataFrame(10*np.ones_like(df1.values[1:3]), index=df1.index[1:3])
In [120]: df1.add(df2)
Out[120]:
0
2018-01-01 NaN
2018-01-02 12.0
2018-01-03 13.0
2018-01-04 NaN
However, I wanted to get
0
2018-01-01 1.0
2018-01-02 12.0
2018-01-03 13.0
2018-01-04 4.0
How can I achieve this? Moreover, is it even possible if df2.index is not a proper subset of df1.index, i.e. if
df2 = pd.DataFrame(10*np.ones_like(df1.values[1:3]), index=pd.DatetimeIndex([df1.index[1], pd.Timestamp('2019-01-01')]))
In [131]: df2
Out[131]:
0
2018-01-02 10
2019-01-01 10
In [132]: df1.add(df2)
Out[132]:
0
2018-01-01 NaN
2018-01-02 12.0
2018-01-03 NaN
2018-01-04 NaN
2019-01-01 NaN
But what I wanted is
0
2018-01-01 1.0
2018-01-02 12.0
2018-01-03 3.0
2018-01-04 4.0
2019-01-01 10.0
Combine with fillna
df1.add(df2).fillna(df1)
Out[581]:
0
2018-01-01 1.0
2018-01-02 12.0
2018-01-03 13.0
2018-01-04 4.0
Ok,
pd.concat([df1,df2]).sum(level=0)
Out[591]:
0
2018-01-01 1
2018-01-02 12
2018-01-03 3
2018-01-04 4
2019-01-01 10

calculates the average of identical columns of several dataframes

I am trying to write a function that calculates the average of identical columns of different dataframes stored in a list:
def mean(dfs):
# declare an empty dataframe
df_mean = pd.DataFrame()
# assign the first column from each raw data framework to df
for i in range(len(dfs)):
dfs[i].set_index(['Time'], inplace=True)
for j in dfs[0].columns:
for i in range(len(dfs)):
df_mean[j] = pd.concat([df_mean,dfs[i][j]], axis=1).mean(axis=1)
return df_mean
dfs = []
l1 = [[1,6,2,6,7],[2,3,2,6,8],[3,3,2,8,8],[4,5,2,6,8],[5,3,9,6,8]]
l2 = [[1,7,2,5,7],[2,3,0,6,8],[3,3,3,6,8],[4,3,7,6,8],[5,3,0,6,8]]
dfs.append(pd.DataFrame(l1, columns=['Time','25','50','75','100']))
dfs.append(pd.DataFrame(l2, columns=['Time','25','50','75','100']))
mean(dfs)
However, I got out only the mean of the first column right!
Option 1
Use python's sum, which well default to reducing the list based on the individual object's __add__ method. Then just divide by the length of the list.
sum(dfs) / len(dfs)
Time 25 50 75 100
0 1.0 6.5 2.0 5.5 7.0
1 2.0 3.0 1.0 6.0 8.0
2 3.0 3.0 2.5 7.0 8.0
3 4.0 4.0 4.5 6.0 8.0
4 5.0 3.0 4.5 6.0 8.0
Option 2
Reconstruct while using numpy's mean function
pd.DataFrame(
np.mean([d.values for d in dfs], 0),
dfs[0].index, dfs[0].columns)
Time 25 50 75 100
0 1.0 6.5 2.0 5.5 7.0
1 2.0 3.0 1.0 6.0 8.0
2 3.0 3.0 2.5 7.0 8.0
3 4.0 4.0 4.5 6.0 8.0
4 5.0 3.0 4.5 6.0 8.0
Use concat on Time indexed list of dataframes, and groupby larger dataframe on Time and take mean
In [275]: pd.concat([d.set_index('Time') for d in dfs]).groupby(level='Time').mean()
Out[275]:
25 50 75 100
Time
1 6.5 2.0 5.5 7.0
2 3.0 1.0 6.0 8.0
3 3.0 2.5 7.0 8.0
4 4.0 4.5 6.0 8.0
5 3.0 4.5 6.0 8.0
Or, since Time column is anyway common for both, atleast in this usecase
In [289]: pd.concat(dfs).groupby(level=0).mean()
Out[289]:
Time 25 50 75 100
0 1.0 6.5 2.0 5.5 7.0
1 2.0 3.0 1.0 6.0 8.0
2 3.0 3.0 2.5 7.0 8.0
3 4.0 4.0 4.5 6.0 8.0
4 5.0 3.0 4.5 6.0 8.0
Details
In [276]: dfs
Out[276]:
[ Time 25 50 75 100
0 1 6 2 6 7
1 2 3 2 6 8
2 3 3 2 8 8
3 4 5 2 6 8
4 5 3 9 6 8, Time 25 50 75 100
0 1 7 2 5 7
1 2 3 0 6 8
2 3 3 3 6 8
3 4 3 7 6 8
4 5 3 0 6 8]
In [277]: pd.concat([d.set_index('Time') for d in dfs])
Out[277]:
25 50 75 100
Time
1 6 2 6 7
2 3 2 6 8
3 3 2 8 8
4 5 2 6 8
5 3 9 6 8
1 7 2 5 7
2 3 0 6 8
3 3 3 6 8
4 3 7 6 8
5 3 0 6 8

Broadcasting pandas dataframe to two dimensional matrix

I have a dataframe of size 12x24 which appears as (I have displayed truncated version):
HE 1 2 3 4 5
0 1 1.8 2.5 3.5 8.5
1 2 2.6 2.9 4.3 8.7
2 3 4.4 2.3 5.3 4.3
3 4 2.6 2.1 4.2 5.3
The column names are number 1 through 12 (representing months) and rows are numbered 1 through 24 (representing hours).
I have another DateTable dataframe which has data as follows:
Date Month Hour
2001-01-01 1 1
2001-02-01 2 4
2001-01-05 1 3
2011-01-31 3 2
2012-01-01 1 5
I want to broadcast the values from 12x24 array into the DateTable to get the following:
Date Month Hour Values
2001-01-01 1 1 3.5
2001-02-01 2 4 5.3
2001-01-05 1 3 2.5
2011-01-31 3 2 2.6
2012-01-01 1 5 1.8
I envision creating some kind of multiindex from the 12x24 table and using against the DateTable but not quite sure about the syntax as I am struggling with syntax.

How to resample dates with Pandas item by item?

My objective is to add rows in pandas in order to replace missing data with previous data and resample dates at the same time. Example :
This is what I have :
date wins losses
2015-12-19 11 5
2015-12-20 17 8
2015-12-20 10 6
2015-12-21 15 1
2015-12-25 11 5
2015-12-26 6 10
2015-12-27 10 6
2015-12-28 4 12
2015-12-29 8 11
And this is what I want :
wins losses
date
2015-12-19 11.0 5.0
2015-12-20 10.0 6.0
2015-12-21 15.0 1.0
2015-12-22 15.0 1.0
2015-12-23 15.0 1.0
2015-12-24 15.0 1.0
2015-12-25 11.0 5.0
2015-12-26 6.0 10.0
2015-12-27 10.0 6.0
2015-12-28 4.0 12.0
2015-12-29 8.0 11.0
And this is my code :
resamp = df.set_index('date').resample('D', how='last', fill_method='ffill')
It works !
But I want to do the same thing with 22 million lines (pandas), with different dates, and different IDs..
This dataframe contains two productID (1 and 2). I want to do the same previous exercice and keep the time serie data of every productID..
createdAt productId popularity
2015-12-01 1 5
2015-12-02 1 8
2015-12-04 1 6
2015-12-07 1 9
2015-12-01 2 5
2015-12-03 2 10
2015-12-04 2 6
2015-12-07 2 12
2015-12-09 2 11
This is my code :
df['date'] = pd.to_datetime(df['createdAt'])
df.set_index('date').resample('D', how='last', fill_method='ffill')
This is what I have if I use the same code ! I don't want a groupby with my dates.
createdAt productId popularity
date
2015-12-01 2015-12-01 2 5
2015-12-02 2015-12-02 2 5
2015-12-03 2015-12-03 2 10
2015-12-04 2015-12-04 2 6
2015-12-05 2015-12-05 2 6
2015-12-06 2015-12-06 2 6
2015-12-07 2015-12-07 2 12
2015-12-08 2015-12-08 2 12
2015-12-09 2015-12-09 2 11
This is what I want !
createdAt productId popularity
2015-12-01 1 5
2015-12-02 1 8
2015-12-03 1 8
2015-12-04 1 6
2015-12-05 1 6
2015-12-06 1 6
2015-12-07 1 9
2015-12-01 2 5
2015-12-02 2 5
2015-12-03 2 10
2015-12-04 2 6
2015-12-05 2 6
2015-12-06 2 6
2015-12-07 2 12
2015-12-08 2 12
2015-12-09 2 11
What to do ?
Thank you
Try this, it should works :)
print df.set_index('date').groupby('productId', group_keys=False).apply(lambda
df: df.resample('D').ffill()).reset_index()
This produces what you said you wanted.
print df.groupby('productId', group_keys=False).apply(lambda df: df.resample('D').ffill()).reset_index()
createdAt productId popularity
0 2015-12-01 1 5
1 2015-12-02 1 8
2 2015-12-03 1 8
3 2015-12-04 1 6
4 2015-12-05 1 6
5 2015-12-06 1 6
6 2015-12-07 1 9
7 2015-12-01 2 5
8 2015-12-02 2 5
9 2015-12-03 2 10
10 2015-12-04 2 6
11 2015-12-05 2 6
12 2015-12-06 2 6
13 2015-12-07 2 12
14 2015-12-08 2 12
15 2015-12-09 2 11