KDB moving percentile using Swin function - list

I am trying to create a list of the 99th and 1st percentiles. Rather than a single percentile for today. I wanted percentiles for 500 days each using the prior 500 days. The functions I was using for this are the following
swin:{[f;w;s] f each { 1_x,y }\[w#0;s]}
percentile:{[x;y] y (100 xrank y:asc y) bin x}
swin[percentile[99;];500;List].
The issue I come across is that the 99th percentile calculates perfectly, but the 1st percentile makes the entire list = 0. a bit lost as to why it would do that. suggestions appreciated!

What's causing the zeros is two-fold:
What behaviour do you want for the earliest 500 days when there isn't 500 days of history to work with? On day 1 there's only 1 datapoint, on day 2 only 2 etc. Only on the 500th day is there 500 days of actual data to work with. By default that swin function fills the gaps with some seed value
You're using zero as that seed value, aka w#0
For example a 5 day lookback on each date looks something like:
q)swin[::;5;1 2 3 4 5]
0 0 0 0 1
0 0 0 1 2
0 0 1 2 3
0 1 2 3 4
1 2 3 4 5
You have zeros until you have data, so naturally the 1st percentile will pick up the zeros for the first roughly 500 dates.
So then you can decide to seed with a different value, or else possibly exclude zeros from your percentile function:
q)List:1000?1000
q)percentile:{[x;y] y (100 xrank y:asc y except 0) bin x}
q)swin[percentile[1;];500;List]
908 360 360 257 257 257 90 90 90 90 90 90 90 90...
If zeros are a legitimate value in your list and can't be excluded then maybe seed the swin with some other value that you know won't be in the list (negatives? infinity? null?) and then exclude that seed from the percentile function.
EDIT: A final alternative is to use a different sliding window function which doesn't fill gaps with a seed value, e.g.
q)swin2:{[f;w;s] f each(),/:{neg[x]sublist y,z}[w]\[s]}
q)swin2[::;5;1 2 3 4 5]
,1
1 2
1 2 3
1 2 3 4
1 2 3 4 5
q)percentile:{[x;y] y (100 xrank y:asc y) bin x}
q)swin2[percentile[99;];500;List]
908 908 908 908 908 908 908 908 908 908 908 959 959..
q)swin2[percentile[1;];500;List]
908 360 360 257 257 257 90 90 90 90 90 90 90 90 90..

Related

How can I collapse my dataset to medians and 95% confidence intervals of the median in Stata?

I wish to collapse my dataset and (A) obtain medians by group, and (B) obtain the 95% confidence intervals for those medians.
I can achieve (A) by using collapse (p50) median = cost, by(group).
I can obtain the confidence intervals for the groups using bysort group: centile cost, c(50) but I ideally want to do this in a manner similar to collapse where I can create a collapsed dataset of means, lower limits (ll) and upper limits (ul) for each group (so I can export the dataset for graphing in Excel).
Data example:
input id group cost
1 0 20
2 0 40
3 0 50
4 0 40
5 0 30
6 1 20
7 1 10
8 1 10
9 1 60
10 1 30
end
Desired dataset (or something similar):
. list
+-----------------------+
| group p50 ll ul |
|-----------------------|
1. | 0 40 20 50 |
2. | 1 20 10 60 |
+-----------------------+
clear
input id group cost
1 0 20
2 0 40
3 0 50
4 0 40
5 0 30
6 1 20
7 1 10
8 1 10
9 1 60
10 1 30
end
statsby median=r(c_1) ub=r(ub_1) lb=r(lb_1), by(group) clear: centile cost
list
+--------------------------+
| group median ub lb |
|--------------------------|
1. | 0 40 50 20 |
2. | 1 20 60 10 |
+--------------------------+
In addition to the usual help and manual entry, this paper includes a riff on essentially this problem of accumulating estimates and confidence intervals.

replace multiple column values at the same time

I would like to replace multiple column values at the same time in a dataframe. I would like to change 2 to 1, 1 to 2.
data=data.frmae(store=c(122,323,254,435,654,342,234,344)
,cluster=c(2,2,2,1,1,3,3,3))
The problem in my code is after it changes 2 to 1 , it changes these 1's to 2.
Can I do it in dplyr or sth? Thank you
Desired data set below
store cluster
122 1
323 1
254 1
435 2
654 2
342 3
234 3
344 3

Counting how many times a condition succeeds in SAS

I have a table A with real time values in it.
Amount Count Pct1 Pct2
300 2 0.000 100.000
1,891 2 0.001 100.000
500 2 0.000 100.000
100 2 0.000 100.000
1,350 2 0.001 100.000
2,648 2 0.001 100.000
2,255 2 0.001 100.000
500 2 0.000 100.000
200 2 0.000 30.441
10 2 0.000 100.000
1,928 2 0.001 100.000
40 2 0.000 100.000
200 2 0.000 100.000
256 2 0.000 100.000
254 2 0.000 100.000
100 2 0.001 100.000
50 1 0.000 33.333
1,512 2 0.001 100.000
I have a table B with a set of conditions. I want to generate the Condition success count in SAS. i.e. If I pass the row 1 in the below table as a condition to the table A it succeeds 2 times. I am using a join to generate a cartesin product and its not efficient. I want an efficient way to solve this problem (similar to what countifs function does in excel). Thanks a lot for your help.
Amount Count Pct1 Pct2 Condion Success Count
1,576 2 0 100 4
1,537 2 0 100 4
1,484 2 0 100 5
1,405 2 0 100 5
1,290 2 0 100 6
1,095 2 0 100 6
948 2 0 100 6
932 2 0 100 6
914 2 0 100 6
887 2 0 100 6
850 2 0 100 6
774 2 0 100 6
707 2 0 100 6
704 2 0 100 6
695 2 0 100 6
646 2 0 100 6
50 1 0 5.42 16
50 1 0 5.42 16
You said that you have tried join to make to make a cartesian product. However, since you didn't post any code I am not sure if you tried to make full product and then calculate the rows. Doing the counting in one SQL statement is much faster since actually full cartesian product is not written anywhere. Like this:
proc sql;
create table tableC as
select c.*, coalesce(s,0) as SuccessCount from TableB c
left join (
select id, count(*) as s from TableA a,TableB b
where
a.amount >= b.amount and
a.count >= b.count and
a.pct1 >= b.pct1 and
a.pct2 >= b.pct2
group by id
) as d
on c.id = d.id
;
quit;
Note that tableB needs to have some unique id column. You should always have some column to use as id but if you don't have it already simple create it like this for example:
data tableB;
set tableB;
id = _N_;
run;

Pandas quantile failing with NaN's present

I've encountered an interesting situation while calculating the inter-quartile range. Assuming we have a dataframe such as:
import pandas as pd
index=pd.date_range('2014 01 01',periods=10,freq='D')
data=pd.np.random.randint(0,100,(10,5))
data = pd.DataFrame(index=index,data=data)
data
Out[90]:
0 1 2 3 4
2014-01-01 33 31 82 3 26
2014-01-02 46 59 0 34 48
2014-01-03 71 2 56 67 54
2014-01-04 90 18 71 12 2
2014-01-05 71 53 5 56 65
2014-01-06 42 78 34 54 40
2014-01-07 80 5 76 12 90
2014-01-08 60 90 84 55 78
2014-01-09 33 11 66 90 8
2014-01-10 40 8 35 36 98
# test for q1 values (this works)
data.quantile(0.25)
Out[111]:
0 40.50
1 8.75
2 34.25
3 17.50
4 29.50
# break it by inserting row of nans
data.iloc[-1] = pd.np.NaN
data.quantile(0.25)
Out[115]:
0 42
1 11
2 34
3 12
4 26
The first quartile can be calculated by taking the median of values in the dataframe that fall below the overall median, so we can see what data.quantile(0.25) should have yielded. e.g.
med = data.median()
q1 = data[data<med].median()
q1
Out[119]:
0 37.5
1 8.0
2 19.5
3 12.0
4 17.0
It seems that quantile is failing to provide an appropriate representation of q1 etc. since it is not doing a good job of handling the NaN values (i.e. it works without NaNs, but not with NaNs).
I thought this may not be a "NaN" issue, rather it might be quantile failing to handle even-numbered data sets (i.e. where the median must be calculated as the mean of the two central numbers). However, after testing with dataframes with both even and odd-numbers of rows I saw that quantile handled these situations properly. The problem seems to arise only when NaN values are present in the dataframe.
I would like to use quntile to calculate the rolling q1/q3 values in my dataframe, however, this will not work with NaN's present. Can anyone provide a solution to this issue?
Internally, quantile uses numpy.percentile over the non-null values. When you change the last row of data to NaNs you're essentially left with an array array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.]) in the first column
Calculating np.percentile(array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.]) gives 42.
From the docstring:
Given a vector V of length N, the qth percentile of V is the qth ranked
value in a sorted copy of V. A weighted average of the two nearest
neighbors is used if the normalized ranking does not match q exactly.
The same as the median if q=50, the same as the minimum if q=0
and the same as the maximum if q=100.

PROC RANK by score: minimum number of a counts of target variable

I have used SAS PROC RANK to rank a population based on score and create groups of equal size. I would like to create groups such that there is a minimum number of target variable (Goods and Bads) in each bin. Is there a way to do that using PROC RANK? I understand that the size of each bin would be different.
For example in the table below, I have created 10 groups based on a certain score. As you can see the Non cures in the lower deciles are sparse. I would like to create groups such there there are at least 10 Non cures in each group.
Cures and Non cures are based on same variable: Cure = 1 and Cure = 0.
Decile cures non cures
0 262 94
1 314 44
2 340 19
3 340 13
4 353 10
5 373 5
6 308 3
7 342 3
8 440 4
9 305 3