Get Percentiles per group in PowerBI

Get Percentiles per group in PowerBI - powerbi

I am trying to calculate percentiles for Bid/Ask prices for a large group of Bond ISINs. Specifically I have my data formatted like below:
Securities
PxBid
PxAsk
PxMid
GroupID
Q1
Q3
AT0000A04967
113.598
114.198
113.898
1
113.7073
114.0221
AT0000A04967
113.684
114.152
113.918
1
113.7073
114.0221
AT0000A04967
113.878453
114.090701
113.984577
1
113.7073
114.0221
AT0000A04967
113.777
114.239
114.008
1
113.7073
114.0221
AT0000A04967
113.809
114.209
114.009
1
113.7073
114.0221
AT0000A04967
113.53
114.53
114.03
1
113.7073
114.0221
AT0000A04967
113.795
114.295
114.045
1
113.7073
114.0221
AT0000A04967
114.07
114.07
114.07
1
113.7073
114.0221
AT0000A04967
114.1
114.1
114.1
1
113.7073
114.0221
AT0000A04967
114.105
114.185
114.145
1
113.7073
114.0221
AT0000A0U3T4
100.355
100.355
100.355
2
100.2763
100.3445
AT0000A0U3T4
100.257
100.457
100.357
2
100.2763
100.3445
AT0000A0U3T4
100.28
100.435
100.358
2
100.2763
100.3445
AT0000A0U3T4
100.284
100.434
100.359
2
100.2763
100.3445
AT0000A0U3T4
100.275
100.443
100.359
2
100.2763
100.3445
AT0000A0U3T4
98.86
101.86
100.36
2
100.2763
100.3445
AT0000A0U3T4
100.311
100.411
100.361
2
100.2763
100.3445
AT0000A0U3T4
100.313055
100.411003
100.362029
2
100.2763
100.3445
AT0000A0U3T4
100.37
100.37
100.37
2
100.2763
100.3445
AT0000A0U3T4
100.3748
100.3948
100.3848
2
100.2763
100.3445
I want to calculate the 25th and 75th percentile per ISIN, e.g. for the group as shown in the table above. I have tried using the below formula:
Q1 = PERCENTILEX.INC(
ALLSELECTED(CleansedBenchmark[Securities]),
CleansedBenchmark[PxBid],
0.25)
But that just gives me the same bid for each row:
Securities
PxBid
PxAsk
PxMid
GroupID
Q1
AT0000A04967
113.598
114.198
113.898
1
113.598
AT0000A04967
113.684
114.152
113.918
1
113.684
AT0000A04967
113.878453
114.090701
113.984577
1
113.8785
AT0000A04967
113.777
114.239
114.008
1
113.777
AT0000A04967
113.809
114.209
114.009
1
113.809
AT0000A04967
113.53
114.53
114.03
1
113.53
AT0000A04967
113.795
114.295
114.045
1
113.795
AT0000A04967
114.07
114.07
114.07
1
114.07
AT0000A04967
114.1
114.1
114.1
1
114.1
AT0000A04967
114.105
114.185
114.145
1
114.105
AT0000A0U3T4
100.355
100.355
100.355
2
100.355
AT0000A0U3T4
100.257
100.457
100.357
2
100.257
AT0000A0U3T4
100.28
100.435
100.358
2
100.28
AT0000A0U3T4
100.284
100.434
100.359
2
100.284
AT0000A0U3T4
100.275
100.443
100.359
2
100.275
AT0000A0U3T4
98.86
101.86
100.36
2
98.86
AT0000A0U3T4
100.311
100.411
100.361
2
100.311
AT0000A0U3T4
100.313055
100.411003
100.362029
2
100.3131
AT0000A0U3T4
100.37
100.37
100.37
2
100.37
AT0000A0U3T4
100.3748
100.3948
100.3848
2
100.3748
I'm sure I am missing something silly here, so any help is appreciated!
Ideally I want this in the same table, but maybe it's more efficient to create a new table to store the results per ISIN, I'm not sure.

If this is a calculated column you're adding to a table, then use the following code.
Q1 =
CALCULATE(
PERCENTILEX.INC(
CleansedBenchmark,
CleansedBenchmark[PxBid],
0.25)
, ALLEXCEPT(CleansedBenchmark, CleansedBenchmark[Securities]))

Related

Creating a boolean panda dataframe from a list within a list

[(['Piano'], 'Beethoven - opus22 4.mid'), (['Piano'], 'Borodin - ps7.mid'), (['Piano'], 'Chopin - op18.mid'), ([None, 'Guitar', 'StringInstrument', 'Acoustic Bass'], 'Cyndi Lauper - True Colors.mid'), (['Piano', 'Fretless Bass', 'StringInstrument', None], 'Frank Mills - Musicbox Dancer.mid'), (['Piano', 'Acoustic Bass', None, 'Baritone Saxophone'], 'George Benson - On Broadway.mid'), (['Piano'], 'Grieg - Voeglein.mid'), (['Piano'], 'Mozart - 333 3.mid'), ([None, 'Pan Flute', 'Piano', 'Piccolo', 'Violin'], 'The Corrs - Dreams.mid'), (['Piano', None, 'Fretless Bass'], 'ABBA - Money Money Money.mid')]
The above-given list is a list of songs with the given instruments used within those songs. I want to make a boolean panda dataframe given these songs with the nonetype instrument removed. The below-given image as an example:
Given dataframe
I tried to make a dataframe given every single instrument and merge these, however, this did not result in the given dataframe.

Try:
import pandas as pd
lst = [
(["Piano"], "Beethoven - opus22 4.mid"),
(["Piano"], "Borodin - ps7.mid"),
(["Piano"], "Chopin - op18.mid"),
(
[None, "Guitar", "StringInstrument", "Acoustic Bass"],
"Cyndi Lauper - True Colors.mid",
),
(
["Piano", "Fretless Bass", "StringInstrument", None],
"Frank Mills - Musicbox Dancer.mid",
),
(
["Piano", "Acoustic Bass", None, "Baritone Saxophone"],
"George Benson - On Broadway.mid",
),
(["Piano"], "Grieg - Voeglein.mid"),
(["Piano"], "Mozart - 333 3.mid"),
(
[None, "Pan Flute", "Piano", "Piccolo", "Violin"],
"The Corrs - Dreams.mid",
),
(["Piano", None, "Fretless Bass"], "ABBA - Money Money Money.mid"),
]
all_data = []
for instruments, title in lst:
d = {"title": title}
for i in instruments:
if not i is None:
d[i] = 1
all_data.append(d)
df = pd.DataFrame(all_data).fillna(0).set_index("title").astype(int)
df.index.name = None
print(df)
Prints:
Piano Guitar StringInstrument Acoustic Bass Fretless Bass Baritone Saxophone Pan Flute Piccolo Violin
Beethoven - opus22 4.mid 1 0 0 0 0 0 0 0 0
Borodin - ps7.mid 1 0 0 0 0 0 0 0 0
Chopin - op18.mid 1 0 0 0 0 0 0 0 0
Cyndi Lauper - True Colors.mid 0 1 1 1 0 0 0 0 0
Frank Mills - Musicbox Dancer.mid 1 0 1 0 1 0 0 0 0
George Benson - On Broadway.mid 1 0 0 1 0 1 0 0 0
Grieg - Voeglein.mid 1 0 0 0 0 0 0 0 0
Mozart - 333 3.mid 1 0 0 0 0 0 0 0 0
The Corrs - Dreams.mid 1 0 0 0 0 0 1 1 1
ABBA - Money Money Money.mid 1 0 0 0 1 0 0 0 0

Transform a list to a list of average values (by step)

I have a two dimensional list of values:
[
[[12.2],[5325]],
[[13.4],[235326]],
[[15.9],[235326]],
[[17.7],[53521]],
[[21.3],[42342]],
[[22.6],[6546]],
[[25.9],[34634]],
[[27.2],[523523]],
[[33.4],[235325]],
[[36.2],[235352]]
]
I would like to get a list of averages defined by a given step so that for a step=10 it would like like this:
[
[[10],[average of all 10-19]],
[[20],[average of all 20-29]],
[[30],[average of all 30-39]]
]
How can I achieve that? Please note that the number of 10s, 20s, 30s and so on is not always the same.

import pandas as pd
df = pd.DataFrame((q[0][0], q[1][0]) for q in thelist)
df['group'] = (df[0] / 10).astype(int)
Now df is:
0 1 group
0 12.2 5325 1
1 13.4 235326 1
2 15.9 235326 1
3 17.7 53521 1
4 21.3 42342 2
5 22.6 6546 2
6 25.9 34634 2
7 27.2 523523 2
8 33.4 235325 3
9 36.2 235352 3
Then:
df.groupby('group').mean()
Gives you the answers you seek:
0 1
group
1 14.80 132374
2 24.25 151761
3 34.80 235338

Cumsum entire table and reset at zero

I have following data frame.
d = pd.DataFrame({'one' : [0,1,1,1,0,1],'two' : [0,0,1,0,1,1]})
d
one two
0 0 0
1 1 0
2 1 1
3 1 0
4 0 1
5 1 1
I want cumulative sum which resets at zero
desired output should be
pd.DataFrame({'one' : [0,1,2,3,0,1],'two' : [0,0,1,0,1,2]})
one two
0 0 0
1 1 0
2 2 1
3 3 0
4 0 1
5 1 2
i have tried using group by but it does not work for entire table.

df2 = df.apply(lambda x: x.groupby((~x.astype(bool)).cumsum()).cumsum())
print(df2)
Output:
one two
0 0 0
1 1 0
2 2 1
3 3 0
4 0 1
5 1 2

pandas
def cum_reset_pd(df):
csum = df.cumsum()
return (csum - csum.where(df == 0).ffill()).astype(d.dtypes)
cum_reset_pd(d)
one two
0 0 0
1 1 0
2 2 1
3 3 0
4 0 1
5 1 2
numpy
def cum_reset_np(df):
v = df.values
z = np.zeros_like(v)
j, i = np.where(v.T)
r = np.arange(1, i.size + 1)
p = np.where(
np.append(False, (np.diff(i) != 1) | (np.diff(j) != 0))
)[0]
b = np.append(0, np.append(p, r.size))
z[i, j] = r - b[:-1].repeat(np.diff(b))
return pd.DataFrame(z, df.index, df.columns)
cum_reset_np(d)
one two
0 0 0
1 1 0
2 2 1
3 3 0
4 0 1
5 1 2
Why go through this trouble?
because it's quicker!

This one is without using Pandas, but using NumPy and list comprehensions:
import numpy as np
d = {'one': [0,1,1,1,0,1], 'two': [0,0,1,0,1,1]}
out = {}
for key in d.keys():
l = d[key]
indices = np.argwhere(np.array(l)==0).flatten()
indices = np.append(indices, len(l))
out[key] = np.concatenate([np.cumsum(l[indices[n-1]:indices[n]]) \
for n in range(1, indices.shape[0])]).ravel()
print(out)
First, I find all occurences of 0 (positions to split the lists), then I calculate cumsum of the resulting sublists and insert them into a new dict.

This should do it:
d = {'one' : [0,1,1,1,0,1],'two' : [0,0,1,0,1,1]}
one = d['one']
two = d['two']
i = 0
new_one = []
for item in one:
if item == 0:
i = 0
else:
i += item
new_one.append(i)
j = 0
new_two = []
for item in two:
if item == 0:
j = 0
else:
j += item
new_two.append(j)
d['one'], d['two'] = new_one, new_two
df = pd.DataFrame(d)

Replacing multiple values per cell in Pandas

I have the following column in a dataframe:
Q2
1 4
1 3
3 4 11
1 4 6 15 16
I want to replace mutiple values in a cell, if present: 1 gets replaced by Facebook, 2 with Instagram, and so on.
I splitted the values as follows:
columns_to_split = 'Q2'
for c in columns_to_split:
df[c] = df[c].str.split(' ')
which outputs
code
DSOKF31 [1, 4]
DSOVH39 [1, 3]
DSOVH05 [3, 4, 16]
DSOVH23 [1, 4, 6, 15, 16]
Name: Q2, dtype: object
but when trying to replace the multiple values with a dictionary as follows:
social_media_2 = {'1':'Facebook', '2':'Instagram', '3':'Twitter', '4':'Messenger (Google hangout, Tagg, WhatsAPP, MSG, Facetime, IMO)', '5':'SnapChat', '6':'Imo', '7':'Badoo', '8':'Viber', '9':'Twoo', '10':'Linkedin', '11':'Flickr', '12':'Meetup', '13':'Tumblr', '14':'Pinterest', '15':'Yahoo', '16':'Gmail', '17':'Hotmail', '18':'M-Pesa', '19':'M-Shwari', '20':'KCB-Mpesa', '21':'Equitel', '22':'MobiKash', '23':'Airtel money', '24':'Orange Money', '25':'Mobile Bankig Accounts', '26':'Other specify'}
df['Q2'] = df['Q2'].replace(social_media_2)
I get the same output:
code
DSOKF31 [1, 4]
DSOVH39 [1, 3]
DSOVH05 [3, 4, 16]
DSOVH23 [1, 4, 6, 15, 16]
Name: Q2, dtype: object
How do I replace multiple values in one cell in this case?

Since the number of items is varying, there isn't a lot of structure. Still, after you split the string, you can apply a function that maps a list into dictionary values:
In [36]: df = pd.DataFrame({'Q2': ['1 4', '1 3', '1 2 3']})
In [37]: df.Q2.str.split(' ').apply(lambda l: [social_media_2[e] for e in l])
Out[37]:
0 [Facebook, Messenger (Google hangout, Tagg, Wh...
1 [Facebook, Twitter]
2 [Facebook, Instagram, Twitter]
Name: Q2, dtype: object
Edit Following Jezrael's excellent comment, here's a version that accounts for missing values as well:
In [58]: df = pd.DataFrame({'Q2': ['1 4', '1 3', '1 2 3', None]})
In [59]: df.Q2.str.split(' ').apply(lambda l: [] if type(l) != list else [social_media_2[e] for e in l])
Out[59]:
0 [Facebook, Messenger (Google hangout, Tagg, Wh...
1 [Facebook, Twitter]
2 [Facebook, Instagram, Twitter]
3 []
Name: Q2, dtype: object

Here is an alternative solution:
In [45]: df
Out[45]:
Q2
0 1 4
1 1 3
2 3 4 16
3 1 4 6 15 16
In [47]: (df.Q2.str.split(expand=True)
....: .stack()
....: .map(social_media_2)
....: .unstack()
....: .apply(lambda x: x.dropna().values.tolist(), axis=1)
....: )
Out[47]:
0 [Facebook, Messenger (Google hangout, Tagg, WhatsAPP, MSG, Facetime, IMO)]
1 [Facebook, Twitter]
2 [Twitter, Messenger (Google hangout, Tagg, WhatsAPP, MSG, Facetime, IMO), Gmail]
3 [Facebook, Messenger (Google hangout, Tagg, WhatsAPP, MSG, Facetime, IMO), Imo, Yahoo, Gmail]
dtype: object
Explanation:
In [50]: df.Q2.str.split(expand=True).stack().map(social_media_2)
Out[50]:
0 0 Facebook
1 Messenger (Google hangout, Tagg, WhatsAPP, MSG, Facetime, IMO)
1 0 Facebook
1 Twitter
2 0 Twitter
1 Messenger (Google hangout, Tagg, WhatsAPP, MSG, Facetime, IMO)
2 Gmail
3 0 Facebook
1 Messenger (Google hangout, Tagg, WhatsAPP, MSG, Facetime, IMO)
2 Imo
3 Yahoo
4 Gmail
dtype: object
In [51]: df.Q2.str.split(expand=True).stack().map(social_media_2).unstack()
Out[51]:
0 1 2 3 4
0 Facebook Messenger (Google hangout, Tagg, WhatsAPP, MSG, Facetime, IMO) None None None
1 Facebook Twitter None None None
2 Twitter Messenger (Google hangout, Tagg, WhatsAPP, MSG, Facetime, IMO) Gmail None None
3 Facebook Messenger (Google hangout, Tagg, WhatsAPP, MSG, Facetime, IMO) Imo Yahoo Gmail
Timing against 40K rows DF:
In [86]: big = pd.concat([df] * 10**4, ignore_index=True)
In [87]: big.shape
Out[87]: (40000, 1)
In [88]: %%timeit
....: (big.Q2.str.split(expand=True)
....: .stack()
....: .map(social_media_2)
....: .unstack()
....: .apply(lambda x: x.dropna().values.tolist(), axis=1)
....: )
....:
1 loop, best of 3: 19.6 s per loop
In [89]: %timeit big.Q2.str.split(' ').apply(lambda l: [social_media_2[e] for e in l])
10 loops, best of 3: 72.6 ms per loop
Conclusion: Ami's solution is approx. 270 times faster!

If dont need list as output add only regex=True to replace:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Q2': ['1 4', '1 3', '3 4 11']})
print (df)
Q2
0 1 4
1 1 3
2 3 4 11
social_media_2 = {'1':'Facebook', '2':'Instagram', '3':'Twitter', '4':'Messenger (Google hangout, Tagg, WhatsAPP, MSG, Facetime, IMO)', '5':'SnapChat', '6':'Imo', '7':'Badoo', '8':'Viber', '9':'Twoo', '10':'Linkedin', '11':'Flickr', '12':'Meetup', '13':'Tumblr', '14':'Pinterest', '15':'Yahoo', '16':'Gmail', '17':'Hotmail', '18':'M-Pesa', '19':'M-Shwari', '20':'KCB-Mpesa', '21':'Equitel', '22':'MobiKash', '23':'Airtel money', '24':'Orange Money', '25':'Mobile Bankig Accounts', '26':'Other specify'}
df['Q2'] = df['Q2'].replace(social_media_2, regex=True)
print (df)
Q2
0 Facebook Messenger (Google hangout, Tagg, What...
1 Facebook Twitter
2 Twitter Messenger (Google hangout, Tagg, Whats...
If need lists, use another solutions.
EDIT by comment:
You can replace whitespace by ; and then it works nice:
df = pd.DataFrame({'Q2': ['1 4', '1 3', '3 4 11']})
print (df)
Q2
0 1 4
1 1 3
2 3 4 11
df['Q2'] = df['Q2'].str.replace(' ',';')
print (df)
Q2
0 1;4
1 1;3
2 3;4;11
social_media_2 = {'1':'Facebook', '2':'Instagram', '3':'Twitter', '4':'Messenger (Google hangout, Tagg, WhatsAPP, MSG, Facetime, IMO)', '5':'SnapChat', '6':'Imo', '7':'Badoo', '8':'Viber', '9':'Twoo', '10':'Linkedin', '11':'Flickr', '12':'Meetup', '13':'Tumblr', '14':'Pinterest', '15':'Yahoo', '16':'Gmail', '17':'Hotmail', '18':'M-Pesa', '19':'M-Shwari', '20':'KCB-Mpesa', '21':'Equitel', '22':'MobiKash', '23':'Airtel money', '24':'Orange Money', '25':'Mobile Bankig Accounts', '26':'Other specify'}
df['Q2'] = df['Q2'].replace(social_media_2, regex=True)
print (df)
Q2
0 Facebook;Messenger (Google hangout, Tagg, What...
1 Facebook;Twitter
2 Twitter;Messenger (Google hangout, Tagg, Whats...
EDIT1:
Tou can also a bit change dict by adding ; to keys and then replace by double ;:
df = pd.DataFrame({'Q2': ['1 2', '1 3', '3 2 11']})
print (df)
Q2
0 1 2
1 1 3
2 3 2 11
df['Q2'] = df['Q2'].str.replace(' ',';;') + ';'
print (df)
Q2
0 1;;2;
1 1;;3;
2 3;;2;;11;
social_media_2 = {'1':'Fa', '2':'I', '3':'T', '11':'KL'}
#add ; to keys in dict
social_media_2 = dict((key + ';', value) for (key, value) in social_media_2.items())
print (social_media_2)
{'1;': 'Fa', '2;': 'I', '3;': 'T', '11;': 'KL'}
df['Q2'] = df['Q2'].replace(social_media_2, regex=True)
print (df)
Q2
0 Fa;I
1 Fa;T
2 T;I;1Fa

Regex Negations in Vim

Question:
How do I convert var x+=1+2+3+(5+6+7) to var x += 1 + 2 + 3 + ( 5 + 6 + 7 )
Details:
Using regular expressions, something like :%s/+/\ x\ /g won't work because it will convert += to + = (amongst other problems). So instead one would use negations (negatives, nots, whatever they're called) like so :%s/\s\#!+/\ +/g, which is about as complicated a way as one can say "plus sign without an empty space before it". But now this converts something like x++ into x + +. What I need is something more complex. I need more than one constraint in the negation, and an additional constraint afterwards. Something like so, but this doesn't work :%s/[\s+]\#!+\x\#!/\ +/g
Could someone please provide the one, or possibly two regex statements which will pad out an example operator, such that I can model the rest of my rules on it/them.
Motivation:
I find beautifiers for languages like javascript or PHP don't give me full control (see here). Therefore, I am attempting to use regex to carry out the following conversions:
foo(1,2,3,4) → foo( 1, 2, 3, 4 )
var x=1*2*3 → var x = 1 * 2 * 3
var x=1%2%3 → var x = 1 % 2 % 3
var x=a&&b&&c → var x = a && b && c
var x=a&b&c → var x = a & b & c
Any feedback would also be appreciated

Thanks to the great feedback, I now have a regular expression like so to work from. I am running these two regular expressions:
:%s/\(\w\)\([+\-*\/%|&~)=]\)/\1\ \2/g
:%s/\([+\-*\/%|&~,(=]\)\(\w\)/\1\ \2/g
And it is working fairly well. Here are some results.
(1+2+3+4,1+2+3+4,1+2+3+4) --> ( 1 + 2 + 3 + 4, 1 + 2 + 3 + 4, 1 + 2 + 3 + 4 )
(1-2-3-4,1-2-3-4,1-2-3-4) --> ( 1 - 2 - 3 - 4, 1 - 2 - 3 - 4, 1 - 2 - 3 - 4 )
(1*2*3*4,1*2*3*4,1*2*3*4) --> ( 1 * 2 * 3 * 4, 1 * 2 * 3 * 4, 1 * 2 * 3 * 4 )
(1/2/3/4,1/2/3/4,1/2/3/4) --> ( 1 / 2 / 3 / 4, 1 / 2 / 3 / 4, 1 / 2 / 3 / 4 )
(1%2%3%4,1%2%3%4,1%2%3%4) --> ( 1 % 2 % 3 % 4, 1 % 2 % 3 % 4, 1 % 2 % 3 % 4 )
(1|2|3|4,1|2|3|4,1|2|3|4) --> ( 1 | 2 | 3 | 4, 1 | 2 | 3 | 4, 1 | 2 | 3 | 4 )
(1&2&3&4,1&2&3&4,1&2&3&4) --> ( 1 & 2 & 3 & 4, 1 & 2 & 3 & 4, 1 & 2 & 3 & 4 )
(1~2~3~4,1~2~3~4,1~2~3~4) --> ( 1 ~ 2 ~ 3 ~ 4, 1 ~ 2 ~ 3 ~ 4, 1 ~ 2 ~ 3 ~ 4 )
(1&&2&&3&&4,1&&2&&3&&4,1&&2&&3&&4) --> ( 1 && 2 && 3 && 4, 1 && 2 && 3 && 4, 1 && 2 && 3 && 4 )
(1||2||3||4,1||2||3||4,1||2||3||4) --> ( 1 || 2 || 3 || 4, 1 || 2 || 3 || 4, 1 || 2 || 3 || 4 )
var x=1+(2+(3+4*(965%(123/(456-789))))); --> var x = 1 +( 2 +( 3 + 4 *( 965 %( 123 /( 456 - 789 )))));
It seems to work fine for everything except nested brackets. If I fix the nested brackets problem, I will update it here.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Get Percentiles per group in PowerBI - powerbi

If this is a calculated column you're adding to a table, then use the following code. Q1 = CALCULATE( PERCENTILEX.INC( CleansedBenchmark, CleansedBenchmark[PxBid], 0.25) , ALLEXCEPT(CleansedBenchmark, CleansedBenchmark[Securities]))

Related

Creating a boolean panda dataframe from a list within a list

Transform a list to a list of average values (by step)

Cumsum entire table and reset at zero

Replacing multiple values per cell in Pandas

Regex Negations in Vim

Categories

Resources