Broadcasting pandas dataframe to two dimensional matrix - python-2.7

I have a dataframe of size 12x24 which appears as (I have displayed truncated version):
HE 1 2 3 4 5
0 1 1.8 2.5 3.5 8.5
1 2 2.6 2.9 4.3 8.7
2 3 4.4 2.3 5.3 4.3
3 4 2.6 2.1 4.2 5.3
The column names are number 1 through 12 (representing months) and rows are numbered 1 through 24 (representing hours).
I have another DateTable dataframe which has data as follows:
Date Month Hour
2001-01-01 1 1
2001-02-01 2 4
2001-01-05 1 3
2011-01-31 3 2
2012-01-01 1 5
I want to broadcast the values from 12x24 array into the DateTable to get the following:
Date Month Hour Values
2001-01-01 1 1 3.5
2001-02-01 2 4 5.3
2001-01-05 1 3 2.5
2011-01-31 3 2 2.6
2012-01-01 1 5 1.8
I envision creating some kind of multiindex from the 12x24 table and using against the DateTable but not quite sure about the syntax as I am struggling with syntax.

Related

How to Pivot data in Power BI and then show a line chart for the pivot-ed data

I have interest rates curves data for different dates and i want to compare them. In excel I create a pivot and then from pivot a chart. How do I the same in power bi?
data example:
example of data pivoted in excel (note the filter here chart comparing the series):
Example of PivotChart
I want to create this chart in Power BI
data in text format
SeriesName
SeqId
Data
Value
EUROIS
1
31-Dec-21
1.1
EUROIS
2
31-Dec-21
1.2
EUROIS
3
31-Dec-21
1.3
EUROIS
4
31-Dec-21
1.4
EUROIS
5
31-Dec-21
1.5
EUREURIBOR3M
1
31-Dec-21
3.2
EUREURIBOR3M
2
31-Dec-21
3.3
EUREURIBOR3M
3
31-Dec-21
3.4
EUREURIBOR3M
4
31-Dec-21
3.5
EUREURIBOR3M
5
31-Dec-21
3.6
EUROIS
1
31-Jan-22
0.1
EUROIS
2
31-Jan-22
0.2
EUROIS
3
31-Jan-22
0.3
EUROIS
4
31-Jan-22
0.4
EUROIS
5
31-Jan-22
0.5
EUREURIBOR3M
1
31-Jan-22
2.2
EUREURIBOR3M
2
31-Jan-22
2.3
EUREURIBOR3M
3
31-Jan-22
2.4
EUREURIBOR3M
4
31-Jan-22
2.5
EUREURIBOR3M
5
31-Jan-22
2.6

cumulative average powerbi by month

I have below dataset.
Math Literature Biology date student
4 2 5 2019-08-25 A
4 5 4 2019-08-08 A
5 4 5 2019-08-23 A
5 5 5 2019-08-15 A
5 5 5 2019-07-19 A
5 5 5 2019-07-15 A
5 5 5 2019-07-03 A
5 5 5 2019-06-26 A
1 1 2 2019-06-18 A
2 3 3 2019-06-14 A
5 5 5 2019-05-01 A
2 1 3 2019-04-26 A
I need to develop a solution in powerbi so in output I have cumulative average per subject per month
For example
April May June July August
Math | 2 3.5 3 3.75 4
Literature | 1 3 3 3.75 3.83
Biology | 3 4 3.6 4.125 4.33
Can you help?
You can use a matrix visualization for this.
Create a month-year variable and use it in the columns.
Use Average of Math,Literature and Biology in values
Under the format pane --> Values --> Show on rows --> Select this
This should give the view you are looking for. You can edit the value headers to your requirement.

calculates the average of identical columns of several dataframes

I am trying to write a function that calculates the average of identical columns of different dataframes stored in a list:
def mean(dfs):
# declare an empty dataframe
df_mean = pd.DataFrame()
# assign the first column from each raw data framework to df
for i in range(len(dfs)):
dfs[i].set_index(['Time'], inplace=True)
for j in dfs[0].columns:
for i in range(len(dfs)):
df_mean[j] = pd.concat([df_mean,dfs[i][j]], axis=1).mean(axis=1)
return df_mean
dfs = []
l1 = [[1,6,2,6,7],[2,3,2,6,8],[3,3,2,8,8],[4,5,2,6,8],[5,3,9,6,8]]
l2 = [[1,7,2,5,7],[2,3,0,6,8],[3,3,3,6,8],[4,3,7,6,8],[5,3,0,6,8]]
dfs.append(pd.DataFrame(l1, columns=['Time','25','50','75','100']))
dfs.append(pd.DataFrame(l2, columns=['Time','25','50','75','100']))
mean(dfs)
However, I got out only the mean of the first column right!
Option 1
Use python's sum, which well default to reducing the list based on the individual object's __add__ method. Then just divide by the length of the list.
sum(dfs) / len(dfs)
Time 25 50 75 100
0 1.0 6.5 2.0 5.5 7.0
1 2.0 3.0 1.0 6.0 8.0
2 3.0 3.0 2.5 7.0 8.0
3 4.0 4.0 4.5 6.0 8.0
4 5.0 3.0 4.5 6.0 8.0
Option 2
Reconstruct while using numpy's mean function
pd.DataFrame(
np.mean([d.values for d in dfs], 0),
dfs[0].index, dfs[0].columns)
Time 25 50 75 100
0 1.0 6.5 2.0 5.5 7.0
1 2.0 3.0 1.0 6.0 8.0
2 3.0 3.0 2.5 7.0 8.0
3 4.0 4.0 4.5 6.0 8.0
4 5.0 3.0 4.5 6.0 8.0
Use concat on Time indexed list of dataframes, and groupby larger dataframe on Time and take mean
In [275]: pd.concat([d.set_index('Time') for d in dfs]).groupby(level='Time').mean()
Out[275]:
25 50 75 100
Time
1 6.5 2.0 5.5 7.0
2 3.0 1.0 6.0 8.0
3 3.0 2.5 7.0 8.0
4 4.0 4.5 6.0 8.0
5 3.0 4.5 6.0 8.0
Or, since Time column is anyway common for both, atleast in this usecase
In [289]: pd.concat(dfs).groupby(level=0).mean()
Out[289]:
Time 25 50 75 100
0 1.0 6.5 2.0 5.5 7.0
1 2.0 3.0 1.0 6.0 8.0
2 3.0 3.0 2.5 7.0 8.0
3 4.0 4.0 4.5 6.0 8.0
4 5.0 3.0 4.5 6.0 8.0
Details
In [276]: dfs
Out[276]:
[ Time 25 50 75 100
0 1 6 2 6 7
1 2 3 2 6 8
2 3 3 2 8 8
3 4 5 2 6 8
4 5 3 9 6 8, Time 25 50 75 100
0 1 7 2 5 7
1 2 3 0 6 8
2 3 3 3 6 8
3 4 3 7 6 8
4 5 3 0 6 8]
In [277]: pd.concat([d.set_index('Time') for d in dfs])
Out[277]:
25 50 75 100
Time
1 6 2 6 7
2 3 2 6 8
3 3 2 8 8
4 5 2 6 8
5 3 9 6 8
1 7 2 5 7
2 3 0 6 8
3 3 3 6 8
4 3 7 6 8
5 3 0 6 8

pandas using melt to create lookup table

I have a dataframe df of size 24x13 which appears as (I have displayed truncated version of 24x13 array which represents 12 months and 24 hours):
HE 1 2 3 4
0 1 1.8 2.5 3.5 8.5
1 2 2.6 2.9 4.3 8.7
2 3 4.4 2.3 5.3 4.3
3 4 2.6 2.1 4.2 5.3
How do I change this to look up table for each combination of hour and month and display the value in third column as follows:
Hour Month Value
1 1 1.8
1 2 2.5
1 3 3.5
I am trying the following and variation of it but this is not working:
pd.melt(df, id_vars=range(1,24), value_vars=range(1,12))
Edit 1:
df.columns
Index([u'HE', 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], dtype='object')
df.shape
(24, 13)
df.set_index('HE').stack().reset_index()
Output:
HE level_1 0
0 1 1 1.8
1 1 2 2.5
2 1 3 3.5
3 1 4 8.5
4 2 1 2.6
OR using melt
df.melt(id_vars='HE').sort_values(by=['HE','variable']
Output:
HE variable value
0 1 1 1.8
4 1 2 2.5
8 1 3 3.5
12 1 4 8.5
1 2 1 2.6

Issue Calculating Mean of Grouped Data for entire range of dataset using Pandas

I have a data set of daily temperatures for which I want to calculate 20 year means. The data look like this:
1974 1 1 5.3 4.6 7.3 3.4
1974 1 2 3.3 7.2 4.5 6.5
...
2005 12 364 4.2 5.2 3.3 4.6
2005 12 365 3.1 5.5 2.6 6.8
There is no header in the file but the first column contains the year, the second column the month, and the third column the day of the year. The rest of the columns are temperature data.
I want to calculate the average temperature for each day over a period of 20 years. I thought the best way to do that would be to group the data by day and calculate the mean of each day for a specific range of years. Here is my code:
import pandas as pd
hist_fn = 'tmean_daily_1974_2005.txt'
twenty_year_fn = '20_yr_mean_1974_1993.txt'
start = 1974
end = 1993
hist_mean = pd.read_csv(hist_fn, sep='\s+', header=None)
# Limit dataframe to only the 20 years for which I want the mean calculated
interval_mean = hist_mean[(hist_mean[0]>=start) & (hist_mean[0]<=end)]
# Rename the first column to reflect what mean this file is displaying
interval_mean.iloc[:, 0] = ("%s-%s" % (start, end))
# Generate mean for each day spread across all the years in the dataframe
interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:]
# Write multiyear mean to txt
interval_mean.to_csv(twenty_year_fn, sep='\t', header=False, index=False)
The data set spans longer than 20 years and the method I used has worked for the first 20 year interval but gives me a (mostly) empty text file for any other set of years entered.
So when I use these inputs it works:
start = 1974
end = 1993
and it produces a file that looks like this:
1974-1993 1 1 4.33 5.25 6.84 3.67
1974-1993 1 2 7.23 6.22 5.65 6.23
...
1974-1993 12 364 5.12 4.34 5.21 2.16
1974-1993 12 365 4.81 5.95 3.56 6.78
but when I change the inputs to this:
start = 1975
end = 1994
it produces a .txt file with no temperatures:
1975-1994 1 1
1975-1994 1 2
...
1975-1994 12 364
1975-1994 12 365
I don't understand why this method works for the first 20 year interval but none of the subsequent intervals. Is it something to do with the way the data is organized or how it is being sliced?
Now when that's out of the way, we can talk about the problem you presented:
The strange behavior is due to the fact that pandas matches indices on assignment, and slicing preserves the original indices. That means that when setting
interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:]
Note that interval_mean.groupby(2, as_index=False).mean() has indices 0, ... , 30 (since as_index=False makes the groupby operation create new indices. Otherwise, it would have been the day number).On the other had, interval_mean has the original indices from hist_mean, meaning the first time (first 20 years) it has the indices 0, ..., ~20*365 and the second time is has indices starting from arround 20*365 and counting up.
This is a bit confusing at first, but pandas offer great documentation about it, and people quickly discover why it is so useful.
I'll to explain what happens with an example:
Assume we have the following DataFrame:
df = pd.DataFrame(np.reshape(np.random.randint(5, size=30), [-1,3]))
df
0 1 2
0 1 1 2
1 2 1 1
2 0 1 2
3 0 2 0
4 2 1 0
5 0 1 2
6 2 2 1
7 1 0 2
8 0 1 0
9 1 2 0
Note that the column names are 0,1,2 and the row names (the index) are 0, ..., 9.
When we preform groupby we obtain
df.groupby(0, as_index=False).mean()
0 1 2
0 0 1.250000 1.000000
1 1 1.000000 1.333333
2 2 1.333333 0.666667
(The index equals to the columns grouped by just because draw numbers between 0 to 2). Now, when will do assignments to df.loc, it will replace every cell by the corresponding cell in the assignee, if such cell exists. Otherwise, it will leave NA.
df.loc[:,:] = df.groupby(0, as_index=False).mean()
df
0 1 2
0 0.0 1.250000 1.000000
1 1.0 1.000000 1.333333
2 2.0 1.333333 0.666667
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
And when you write NA to csv, it leaves the cell blank.
The last piece of the puzzle is how interval_mean preserved the original indices, but this is because slicing preserves the original indices:
df[df[1] > 1]
0 1 2
3 0 2 0
6 2 2 1
9 1 2 0