Pandas quantile failing with NaN's present - python-2.7

I've encountered an interesting situation while calculating the inter-quartile range. Assuming we have a dataframe such as:
import pandas as pd
index=pd.date_range('2014 01 01',periods=10,freq='D')
data=pd.np.random.randint(0,100,(10,5))
data = pd.DataFrame(index=index,data=data)
data
Out[90]:
0 1 2 3 4
2014-01-01 33 31 82 3 26
2014-01-02 46 59 0 34 48
2014-01-03 71 2 56 67 54
2014-01-04 90 18 71 12 2
2014-01-05 71 53 5 56 65
2014-01-06 42 78 34 54 40
2014-01-07 80 5 76 12 90
2014-01-08 60 90 84 55 78
2014-01-09 33 11 66 90 8
2014-01-10 40 8 35 36 98
# test for q1 values (this works)
data.quantile(0.25)
Out[111]:
0 40.50
1 8.75
2 34.25
3 17.50
4 29.50
# break it by inserting row of nans
data.iloc[-1] = pd.np.NaN
data.quantile(0.25)
Out[115]:
0 42
1 11
2 34
3 12
4 26
The first quartile can be calculated by taking the median of values in the dataframe that fall below the overall median, so we can see what data.quantile(0.25) should have yielded. e.g.
med = data.median()
q1 = data[data<med].median()
q1
Out[119]:
0 37.5
1 8.0
2 19.5
3 12.0
4 17.0
It seems that quantile is failing to provide an appropriate representation of q1 etc. since it is not doing a good job of handling the NaN values (i.e. it works without NaNs, but not with NaNs).
I thought this may not be a "NaN" issue, rather it might be quantile failing to handle even-numbered data sets (i.e. where the median must be calculated as the mean of the two central numbers). However, after testing with dataframes with both even and odd-numbers of rows I saw that quantile handled these situations properly. The problem seems to arise only when NaN values are present in the dataframe.
I would like to use quntile to calculate the rolling q1/q3 values in my dataframe, however, this will not work with NaN's present. Can anyone provide a solution to this issue?

Internally, quantile uses numpy.percentile over the non-null values. When you change the last row of data to NaNs you're essentially left with an array array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.]) in the first column
Calculating np.percentile(array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.]) gives 42.
From the docstring:
Given a vector V of length N, the qth percentile of V is the qth ranked
value in a sorted copy of V. A weighted average of the two nearest
neighbors is used if the normalized ranking does not match q exactly.
The same as the median if q=50, the same as the minimum if q=0
and the same as the maximum if q=100.

Related

KDB moving percentile using Swin function

I am trying to create a list of the 99th and 1st percentiles. Rather than a single percentile for today. I wanted percentiles for 500 days each using the prior 500 days. The functions I was using for this are the following
swin:{[f;w;s] f each { 1_x,y }\[w#0;s]}
percentile:{[x;y] y (100 xrank y:asc y) bin x}
swin[percentile[99;];500;List].
The issue I come across is that the 99th percentile calculates perfectly, but the 1st percentile makes the entire list = 0. a bit lost as to why it would do that. suggestions appreciated!
What's causing the zeros is two-fold:
What behaviour do you want for the earliest 500 days when there isn't 500 days of history to work with? On day 1 there's only 1 datapoint, on day 2 only 2 etc. Only on the 500th day is there 500 days of actual data to work with. By default that swin function fills the gaps with some seed value
You're using zero as that seed value, aka w#0
For example a 5 day lookback on each date looks something like:
q)swin[::;5;1 2 3 4 5]
0 0 0 0 1
0 0 0 1 2
0 0 1 2 3
0 1 2 3 4
1 2 3 4 5
You have zeros until you have data, so naturally the 1st percentile will pick up the zeros for the first roughly 500 dates.
So then you can decide to seed with a different value, or else possibly exclude zeros from your percentile function:
q)List:1000?1000
q)percentile:{[x;y] y (100 xrank y:asc y except 0) bin x}
q)swin[percentile[1;];500;List]
908 360 360 257 257 257 90 90 90 90 90 90 90 90...
If zeros are a legitimate value in your list and can't be excluded then maybe seed the swin with some other value that you know won't be in the list (negatives? infinity? null?) and then exclude that seed from the percentile function.
EDIT: A final alternative is to use a different sliding window function which doesn't fill gaps with a seed value, e.g.
q)swin2:{[f;w;s] f each(),/:{neg[x]sublist y,z}[w]\[s]}
q)swin2[::;5;1 2 3 4 5]
,1
1 2
1 2 3
1 2 3 4
1 2 3 4 5
q)percentile:{[x;y] y (100 xrank y:asc y) bin x}
q)swin2[percentile[99;];500;List]
908 908 908 908 908 908 908 908 908 908 908 959 959..
q)swin2[percentile[1;];500;List]
908 360 360 257 257 257 90 90 90 90 90 90 90 90 90..

Quantization tabels of different sizes

I am doing some research on image compression via discrete cosine transformations and i want to change to quantization tables sizes so that i can study what happens when i change the matrix sizes that i divide my pictures into. The standard sub-matrix size is 8X8 and there is a lot of tables for those dimensions. For example the standard JPEG quantization table (that i use) is:
standardmatrix8 = np.matrix('16 11 10 16 24 40 51 61;\
12 12 14 19 26 58 60 55;\
14 13 16 24 40 57 69 56;\
14 17 22 29 51 87 80 62;\
18 22 37 56 68 109 103 77;\
24 35 55 64 81 104 103 92;\
49 64 78 77 103 121 120 101;\
72 92 95 98 112 100 103 99').astype('float')
I have assumed that the quantization tabel for 2X2 and 4X4 would be:
standardmatrix2 = np.matrix('16 11; 12 12.astype('float')
standardmatrix4 = np.matrix('16 11 10 16;
12 12 14 19;
14 13 16 24;
18 22 37 56').astype('float')
Since the entries in the standard table correspond to the same frequencies in the smaller matrixes.
But what about a quantization with dimentions 16X16, 24X24 and so on. I know that the standard quantization tabels are worked out by experiments and can't be calculated from some formula, but i assume that someone have tried changing the matrix sizes before me! Where can i find these tabels? Or can i just make something up and scale the last entries to higher frequenzies?

Eliminate duplicate pairs of numbers and total their quantities in openoffice Calc

I have a Calc sheet listing a cut-list for plywood in two columns with a quantity in a third column. I would like to remove duplicate matching pairs of dimensions and total the quantity. Starting with:
A B C
25 35 2
25 40 1
25 45 3
25 45 2
35 45 1
35 50 3
40 25 1
40 25 1
Ending with:
A B C
25 35 2
25 40 1
25 45 5
35 45 1
35 50 3
40 25 2
I'm trying to automate this. Currently I have multiple lists which occupy the same page which need to be totaled independently of each other.
Put a unique different ListId, ListCode or ListNumber for each of the lists. Let all rows falling into the same list, have the same value for this field.
Concatenate A & B and form a new column, say, PairAB.
If the list is small and handlable, filter for PairAB and collect totals.
Otherwise, use Grouping and subtotals to get totals for each list and each pair, grouping on ListId and PairAB.
If the list is very large, you are better off taking it to CSV, and onward to a database, such things are simple child's play in SQL.

Cut off point in k-means clustering in sas

So I want to classify my data into clusters with cut-off point in SAS. The method I use is k-means clustering. (I don't mind about the method, as long as, it gives me 3 groups.)
My code for clustering:
proc fastclus data=maindat outseed=seeds1 maxcluster =3 maxiter=0;
var value resid;
run;
I have the problem with the output result. I want the cut-off point for the Value to be include in the output file. (I don't want the cut-off point for Resid). So is there anyway to do this in SAS?
Edit: As Joe point out, I can't achieve what i'm looking for by using k-mean clustering. So is there another way? Basically, I want a cut-off point so that I can apply it to the another data set.
What I have:
Cluster Value Resid
1 34 11.7668
2 38.9 0.5328
3 42.625 -13.2364
what I want:
Cluster Value Resid Cut-off Value (Interger)
1 34 11.7668 1-36
2 38.9 0.5328 36-40
3 42.625 -13.2364 40-44
My data:
data maindat;
input value Resid ;
datalines;
44 -4.300511714
44 -9.646920963
44 -15.86956805
43 -16.14857235
43 -13.05797186
43 -13.80941206
42 -3.521394503
42 -1.102526302
42 -0.137573583
42 2.669238665
42 -9.540489193
42 -19.27474303
42 -3.527077011
41 1.676464068
41 -2.238822314
41 4.663079037
41 -5.346920963
40 -8.543723186
40 0.507460641
40 0.995302284
40 0.464194011
39 4.728791571
39 5.578685423
38 2.771297564
38 7.109159247
37 15.96059456
37 2.985292226
36 -4.301136971
35 5.854674875
35 5.797294021
34 4.393329025
33 -6.622580905
32 0.268500302
27 12.23062252
;
run;
I don't think you could necessarily do this completely.
k-means clustering uses euclidean distance between all of the variables you provide it. This means that it's not solely using value to cluster observations: it's using Resid as well.
As such, it's possible a row with a value that seems like it should go with cluster 2 should actually go with cluster 3, if the Resid value is much closer there.
In your example, if you request an out dataset, you will see this is true. A proc freq of that out dataset reveals that cluster 1 has three rows, with values 27, 37, and 38. Cluster 2 has almost all of the rows - all but 7 in total - ranging from 32 to 44. Cluster 3 ranges from 40 to 44.
As such, there's no reasonable way to define your clusters the way you ask with this method of clustering. Clusters are typically defined by their centroid, and that's what you get with the outstat dataset; you can determine which cluster a particular value should be assigned based on this.

How to append a new column to my Pandas DataFrame based on a row-based calculation?

Let's say I have a Pandas DataFrame with two columns: 1) user_id, 2) steps (which contains the number of steps on the given date). Now I want to calculate the difference between the number of steps and the number of steps in the preceding measurement (measurements are guaranteed to be in order within my DataFrame).
So basically this comes down to appending an extra column to my DataFrame where the row values of this data frame match the value of the column 'steps' within this same row, minus the value of the 'steps' column in the row above (or 0 if this is the first row). To complicate things further, I want to calculate these differences per user_id, so I want to make sure that I do not subtract the steps values of two rows with different user_id's.
Does anyone have an idea how to get this done with Python 2.7 and Panda?
So an example to illustrate this.
Example input:
user_id steps
1015 48
1015 23
1015 79
1016 10
1016 20
Desired output:
user_id steps d_steps
1015 48 0
1015 23 -25
1015 79 56
2023 10 0
2023 20 10
Your output shows user ids that are not in you orig data but the following does what you want, you will have to replace/fill the NaN values with 0:
In [16]:
df['d_steps'] = df.groupby('user_id').transform('diff')
df.fillna(0, inplace=True)
df
Out[16]:
user_id steps d_steps
0 1015 48 0
1 1015 23 -25
2 1015 79 56
3 1016 10 0
4 1016 20 10
Here we generate the desired column by calling transform on the groupby by object and pass a string which maps to the diff method which subtracts the previous row value. Transform applies a function and returns a series with an index aligned to the df.