Sorting on multiindex level in pandas dataframe - python-2.7

I have the following pivot table
MQW MSND Grand Total
Amount($m) Amount($m) Amount($m)
Total Count Total Count Total Count
Margin Call Date
2016-12-06 16.99 4 8.50 6 25.50 10
2016-12-07 11.24 4 8.55 6 19.79 10
2016-12-08 4.21 5 8.28 6 12.49 11
2016-12-09 23.29 7 8.08 6 31.37 13
2016-12-12 0.29 1 8.73 6 9.02 7
Total 56.03 21 42.14 30 98.18 51
with the structure
MultiIndex(levels=[[u' Grand Total', u'MSND', u'MQW'], [u'Amount($m)'], [u'Count', u'Total']],labels=[[2, 2, 1, 1, 0, 0], [0, 0, 0, 0, 0, 0], [1, 0, 1, 0, 1, 0]])
and for the life of me I can't get the 'Count' and 'Total' columns to switch places using the .sortlevel method without also reversing the order of 'MQW','MSND', and 'Grand Total'. I've also tried setting 'sort_remaining' = False but it isn't working. This is what I'm trying to get.
MQW MSND Grand Total
Amount($m) Amount($m) Amount($m)
Count Total Count Total Count Total
Margin Call Date
2016-12-06 13.99 4 7.50 6 35.50 10
2016-12-07 1.24 4 16.55 6 9.79 10
2016-12-08 7.21 5 0.28 6 22.49 11
2016-12-09 33.29 7 9.08 6 21.37 13
2016-12-12 0.29 1 8.73 6 9.02 7
Total 56.03 21 42.14 30 98.18 51
Any help would be much appreciated!

The following solution works. However, I believe some easier alternative should be possible.
First, create a new index inverting the level 2 labels like this:
idx = df.columns
new_idx1 = idx.set_levels(idx.levels[2][::-1], level=2)
# or, equivalently,
# new_idx1 = idx.set_levels(['Total', 'Count'], level=2)
or maybe better to change the codes of the labels:
new_idx2 = idx.set_labels(labels=[0, 1] * 3, level=2)
Note that the inner structure of new_idx2 is different than new_idx1, even though they seem to be the same. (The results of sortlevel applied on them will be different.)
You can also create a new_idx from scratch with pd.MultiIndex, pd.MultiIndex.from_arrays or pd.MultiIndex.from_tuples.
And now reindex, for example:
df_sorted = df.reindex(columns=new_idx2)
df_sorted
Out[337]:
MQW MSND Grand Total
Amount($m) Amount($m) Amount($m)
Count Total Count Total Count Total
2016-12-06 4 13.99 6 7.50 10 35.50
2016-12-07 4 1.24 6 16.55 10 9.79
2016-12-08 5 7.21 6 0.28 11 22.49
2016-12-09 7 33.29 6 9.08 13 21.37
2016-12-12 1 0.29 6 8.73 7 9.02
Total 21 56.03 30 42.14 51 98.18

Related

Subtracting values based on a index column and using a condition in the same column in DAX

I've a lot of material on Stack about this, but i'm still not able to reproduce it.
Sample data set.
Asset
Value
Index
A
10
1
B
15
1
C
20
1
A
11
2
B
17
2
C
24
2
A
18
3
B
25
3
C
30
3
What i want to do is, subtract the Asset values individually based on the index column.
Ex:
Asset A [1] -> 10
Asset A [2] -> 11
11 - 10 = 1
So the table would be like this.
Asset
Value
Index
Diff
A
10
1
0
B
15
1
0
C
20
1
0
A
11
2
1
B
17
2
2
C
24
2
4
A
18
3
7
B
25
3
8
C
30
3
6
This need's to be done using DAX.
Can you guys help me ?
Best Regards!
I just did this and it worked.
Diff =
var Assets = 'Table'[Asset]
var Ind = 'Table'[Index] - 1
Return
IF(Ind = -1, 0, 'Table'[Value] - CALCULATE(SUM('Table'[Value]),FILTER('Table','Table'[Asset] = Assets && 'Table'[Index] = Ind)))

Get the max of the average for each group

have the following table :
EmpId DeptId WeekNumber Month NumberofCalls
1 3 4 1 34
2 3 2 3 59
I created a measure to calculate the average of number of calls :
AvgCalls = AVG('MyTable'[NumberofCalls])
now I want to get the max average calls by month, week.
I will be having 3 filters :
Month
Week
Once I select all of them, the result in the histogram bar will be the employee having the max average calls.
Once I select the Month and the Week I want the histogram to display the code of the Employee (W1,W2,W3...) having the maximum average, in my case I get the following result all the employees but not the employee having the max average.
Here is my solution:
I tested it with some random datasets, Here is my data:
EmpId DeptId WeekNumber Month NumberofCalls
Emp01 3 W4 1 34
Emp01 3 W2 3 59
Emp02 3 W5 4 68
Emp02 3 W6 4 76
Emp03 3 W10 5 90
Emp04 4 W10 6 98
Emp04 4 W11 6 45
Emp05 4 W12 7 56
Emp06 4 W13 7 23
Emp07 4 W15 9 45
Emp08 4 W34 8 56
Emp09 4 W52 8 44
Emp05 4 W36 9 23
Emp01 4 W17 10 51
Emp02 4 W23 9 67
Emp06 4 W29 11 28
Emp05 4 W34 12 34
Emp07 4 W41 11 21
Emp04 4 W37 12 33
I wrote this measure using Iterator Function (ADDCOLUMNS):
MaxAverageEmployer =
VAR TAvgCalls =
ADDCOLUMNS(
SUMMARIZE(MyTable,MyTable[EmpId],MyTable[Month ],MyTable[WeekNumber ]),
"AvgCall",CALCULATE(AVERAGE('MyTable'[NumberofCalls]))
)
VAR TMaxAvgCalls =
ADDCOLUMNS(
TAvgCalls,
"MaxAvg",CALCULATE(MAXX(TAvgCalls,[AvgCall]))
)
VAR MaxEmpID =
ADDCOLUMNS(
TMaxAvgCalls,
"MaxEmp",CALCULATE(VALUES(MyTable[EmpId]),FILTER(TMaxAvgCalls,[AvgCall] = [MaxAvg]))
)
RETURN
MAXX(MaxEmpID,[MaxEmp])
Here is the part:
It showed nothing when I tried to show it on histogram (or Bar Chart Visual); but It gave me correct values on a table visual:
WeekNumber : I put in on Rows
MonthNumber : I put it on Slicer to filter it!
Here is the final solution, and I hope It is what you are looking for!

Get previous rowindex in Google Sheets where certain columnvalue is zero

Consider a sheet like:
rowNr | Another Col | Filled | Cumul. Size
0 2 -1000 -1000
1 3 1000 0
2 1 -5000 -5000
3 4 5000 0
4 5 -10000 -10000
5 2 -10000 -20000
6 1 -20000 -40000
6 4 40000 0
The 'Cumul. Size'-column displays the cumulative sum of the 'filled' column.
each time Cummulutive Size = 0, I need to calculate the sum of 'Another Column' for all previous rows until 'Cummulutive Size' != 0 again. For rows where 'Cummulutive Size' = 0, display '' (blank)
So something like this:
rowNr | Another Col | Filled | Cumul. Size | calculated
0 2 -1000 -1000
1 3 1000 0 5
2 1 -5000 -5000
3 4 5000 0 5
4 5 -10000 -10000
5 2 -10000 -20000
6 1 -20000 -40000
6 4 40000 0 12
I'm sure I can create something working as long as I can find a function with a signature similar to: findPreviousRowIndex(curRowIndex, whereCondition)
Any pointers much appreciated
EDIT
Link To example Google Sheet
paste in D2 cell and drag down:
=ARRAYFORMULA(IF(LEN(A2), IF(C2=0, SUM(INDIRECT(ADDRESS(IFERROR(MAX(IF(
INDIRECT("C1:C"&ROW()-1)=0, ROW(A:A), ))+1, 2), 1, 4)&":A"&ROW())), ), ))

How to retain calculated values between rows when calculating running totals?

I have a tricky question about conditional sum in SAS. Actually, it is very complicated for me and therefore, I cannot explain it by words. Therefore I want to show an example:
A B
5 3
7 2
8 6
6 4
9 5
8 2
3 1
4 3
As you can see, I have a datasheet that has two columns. First of all, I calculated the conditional cumulative sum of column A ( I can do it by myself-So no need help for that step):
A B CA
5 3 5
7 2 12
8 6 18
6 4 8 ((12+8)-18)+6
9 5 17
8 2 18
3 1 10 (((17+8)-18)+3
4 3 14
So my condition value is 18. If the cumulative more than 18, then it equal 18 and next value if sum of the first value after 18 and exceeds amount over 18. ( As I said I can do it by myself )
So the tricky part is I have to calculate the cumulative sum of column B according to column A:
A B CA CB
5 3 5 3
7 2 12 5
8 6 18 9.5 (5+(6*((18-12)/8)))
6 4 8 5.5 ((5+6)-9.5)+4
9 5 17 10.5 (5.5+5)
8 2 18 10.75 (10.5+(2*((18-7)/8)))
3 1 10 2.75 ((10.5+2)-10.75)+1
4 3 14 5.75 (2.75+3)
As you can see from example the cumulative sum of column B is very specific. When column CA is equal to our condition value (18), then we calculate the proportion of the last value for getting our condition value (18) and then use this proportion for computing cumulative sum of column B.
Looks like when the sum of A reaches 18 or more you want to split the values of A and B between the current and the next record. One way is to remember the left over values for A and B and carry them forward in your new cumulative variables. Just make sure to output the observation before resetting those variables.
data want ;
set have ;
ca+a;
cb+b;
if ca >= 18 then do;
extra_a=ca - 18;
extra_b=b - b*((a - extra_a)/a) ;
ca=18;
cb=cb-extra_b ;
end;
output;
if ca=18 then do;
ca=extra_a;
cb=extra_b;
end;
drop extra_a extra_b ;
run;

Merge pandas DataFrames based on irregular time intervals

I'm wondering how I can speed up a merge of two dataframes. One of the dataframes has time stamped data points (value col).
import pandas as pd
import numpy as np
data = pd.DataFrame({'time':np.sort(np.random.uniform(0,100,size=50)),
'value':np.random.uniform(-1,1,size=50)})
The other has time interval information (start_time, end_time, and associated interval_id).
intervals = pd.DataFrame({'interval_id':np.arange(9),
'start_time':np.random.uniform(0,5,size=9) + np.arange(0,90,10),
'end_time':np.random.uniform(5,10,size=9) + np.arange(0,90,10)})
I'd like to merge these two dataframes more efficiently than the for loop below:
data['interval_id'] = np.nan
for index, ser in intervals.iterrows():
in_interval = (data['time'] >= ser['start_time']) & \
(data['time'] <= ser['end_time'])
data['interval_id'][in_interval] = ser['interval_id']
result = data.merge(intervals, how='outer').sort('time').reset_index(drop=True)
I keep imagining I'll be able to use pandas time series functionality, like a date range or TimeGrouper, but I have yet to figure out anything more pythonic (pandas-y?) than the above.
Example result:
time value interval_id start_time end_time
0 0.575976 0.022727 NaN NaN NaN
1 4.607545 0.222568 0 3.618715 8.294847
2 5.179350 0.438052 0 3.618715 8.294847
3 11.069956 0.641269 1 10.301728 19.870283
4 12.387854 0.344192 1 10.301728 19.870283
5 18.889691 0.582946 1 10.301728 19.870283
6 20.850469 -0.027436 NaN NaN NaN
7 23.199618 0.731316 2 21.488868 28.968338
8 26.631284 0.570647 2 21.488868 28.968338
9 26.996397 0.597035 2 21.488868 28.968338
10 28.601867 -0.131712 2 21.488868 28.968338
11 28.660986 0.710856 2 21.488868 28.968338
12 28.875395 -0.355208 2 21.488868 28.968338
13 28.959320 -0.430759 2 21.488868 28.968338
14 29.702800 -0.554742 NaN NaN NaN
Any suggestions from time series-savvy people out there would be greatly appreciated.
Update, after Jeff's answer:
The main problem is that interval_id has no relation to any regular time interval (e.g., intervals are not always approximately 10 seconds). One interval could be 10 seconds, the next could be 2 seconds, and the next could be 100 seconds, so I can't use any regular rounding scheme as Jeff proposed. Unfortunately, my minimal example above does not make that clear.
You could use np.searchsorted to find the indices representing where each value in data['time'] would fit between intervals['start_time']. Then you could call np.searchsorted again to find the indices representing where each value in data['time'] would fit between intervals['end_time']. Note that using np.searchsorted relies on interval['start_time'] and interval['end_time'] being in sorted order.
For each corresponding location in the arrays, where these two indices are equal, data['time'] fits in between interval['start_time'] and interval['end_time']. Note that this relies on the intervals being disjoint.
Using searchsorted in this way is about 5 times faster than using the for-loop:
import pandas as pd
import numpy as np
np.random.seed(1)
data = pd.DataFrame({'time':np.sort(np.random.uniform(0,100,size=50)),
'value':np.random.uniform(-1,1,size=50)})
intervals = pd.DataFrame(
{'interval_id':np.arange(9),
'start_time':np.random.uniform(0,5,size=9) + np.arange(0,90,10),
'end_time':np.random.uniform(5,10,size=9) + np.arange(0,90,10)})
def using_loop():
data['interval_id'] = np.nan
for index, ser in intervals.iterrows():
in_interval = (data['time'] >= ser['start_time']) & \
(data['time'] <= ser['end_time'])
data['interval_id'][in_interval] = ser['interval_id']
result = data.merge(intervals, how='outer').sort('time').reset_index(drop=True)
return result
def using_searchsorted():
start_idx = np.searchsorted(intervals['start_time'].values, data['time'].values)-1
end_idx = np.searchsorted(intervals['end_time'].values, data['time'].values)
mask = (start_idx == end_idx)
result = data.copy()
result['interval_id'] = result['start_time'] = result['end_time'] = np.nan
result['interval_id'][mask] = start_idx
result.ix[mask, 'start_time'] = intervals['start_time'][start_idx[mask]].values
result.ix[mask, 'end_time'] = intervals['end_time'][end_idx[mask]].values
return result
In [254]: %timeit using_loop()
100 loops, best of 3: 7.74 ms per loop
In [255]: %timeit using_searchsorted()
1000 loops, best of 3: 1.56 ms per loop
In [256]: 7.74/1.56
Out[256]: 4.961538461538462
you may want to have the intervals of 'time' specified slightly different, but should give you a start.
In [34]: data['on'] = np.round(data['time']/10)
In [35]: data.merge(intervals,left_on=['on'],right_on=['interval_id'],how='outer')
Out[35]:
time value on end_time interval_id start_time
0 1.301658 -0.462594 0 7.630243 0 0.220746
1 2.202654 0.054903 0 7.630243 0 0.220746
2 10.253593 0.329947 1 17.715596 1 10.299464
3 13.803064 -0.601021 1 17.715596 1 10.299464
4 17.086290 0.484119 2 27.175455 2 24.710704
5 21.797655 0.988212 2 27.175455 2 24.710704
6 26.265165 0.491410 3 37.702968 3 30.670753
7 27.777182 -0.121691 3 37.702968 3 30.670753
8 34.066473 0.659260 3 37.702968 3 30.670753
9 34.786337 -0.230026 3 37.702968 3 30.670753
10 35.343021 0.364505 4 49.489028 4 42.948486
11 35.506895 0.953562 4 49.489028 4 42.948486
12 36.129951 -0.703457 4 49.489028 4 42.948486
13 38.794690 -0.510535 4 49.489028 4 42.948486
14 40.508702 -0.763417 4 49.489028 4 42.948486
15 43.974516 -0.149487 4 49.489028 4 42.948486
16 46.219554 0.893025 5 57.086065 5 53.124795
17 50.206860 0.729106 5 57.086065 5 53.124795
18 50.395082 -0.807557 5 57.086065 5 53.124795
19 50.410783 0.996247 5 57.086065 5 53.124795
20 51.602892 0.144483 5 57.086065 5 53.124795
21 52.006921 -0.979778 5 57.086065 5 53.124795
22 52.682896 -0.593500 5 57.086065 5 53.124795
23 52.836037 0.448370 5 57.086065 5 53.124795
24 53.052130 -0.227245 5 57.086065 5 53.124795
25 57.169775 0.659673 6 65.927106 6 61.590948
26 59.336176 -0.893004 6 65.927106 6 61.590948
27 60.297771 0.897418 6 65.927106 6 61.590948
28 61.151664 0.176229 6 65.927106 6 61.590948
29 61.769023 0.894644 6 65.927106 6 61.590948
30 64.221220 0.893012 6 65.927106 6 61.590948
31 67.907417 -0.859734 7 78.192671 7 72.463468
32 71.460483 -0.271364 7 78.192671 7 72.463468
33 74.514028 0.621174 7 78.192671 7 72.463468
34 75.822643 -0.351684 8 88.820139 8 83.183825
35 84.252778 -0.685043 8 88.820139 8 83.183825
36 84.838361 0.354365 8 88.820139 8 83.183825
37 85.770611 -0.089678 9 NaN NaN NaN
38 85.957559 0.649995 9 NaN NaN NaN
39 86.498339 0.569793 9 NaN NaN NaN
40 91.006735 0.731006 9 NaN NaN NaN
41 91.941862 0.964376 9 NaN NaN NaN
42 94.617522 0.626889 9 NaN NaN NaN
43 95.318288 -0.088918 10 NaN NaN NaN
44 95.595243 0.539685 10 NaN NaN NaN
45 95.818267 -0.989647 10 NaN NaN NaN
46 98.240444 0.931445 10 NaN NaN NaN
47 98.722869 0.442502 10 NaN NaN NaN
48 99.349198 0.585264 10 NaN NaN NaN
49 99.829372 -0.743697 10 NaN NaN NaN
[50 rows x 6 columns]