Cut off point in k-means clustering in sas - sas

So I want to classify my data into clusters with cut-off point in SAS. The method I use is k-means clustering. (I don't mind about the method, as long as, it gives me 3 groups.)
My code for clustering:
proc fastclus data=maindat outseed=seeds1 maxcluster =3 maxiter=0;
var value resid;
run;
I have the problem with the output result. I want the cut-off point for the Value to be include in the output file. (I don't want the cut-off point for Resid). So is there anyway to do this in SAS?
Edit: As Joe point out, I can't achieve what i'm looking for by using k-mean clustering. So is there another way? Basically, I want a cut-off point so that I can apply it to the another data set.
What I have:
Cluster Value Resid
1 34 11.7668
2 38.9 0.5328
3 42.625 -13.2364
what I want:
Cluster Value Resid Cut-off Value (Interger)
1 34 11.7668 1-36
2 38.9 0.5328 36-40
3 42.625 -13.2364 40-44
My data:
data maindat;
input value Resid ;
datalines;
44 -4.300511714
44 -9.646920963
44 -15.86956805
43 -16.14857235
43 -13.05797186
43 -13.80941206
42 -3.521394503
42 -1.102526302
42 -0.137573583
42 2.669238665
42 -9.540489193
42 -19.27474303
42 -3.527077011
41 1.676464068
41 -2.238822314
41 4.663079037
41 -5.346920963
40 -8.543723186
40 0.507460641
40 0.995302284
40 0.464194011
39 4.728791571
39 5.578685423
38 2.771297564
38 7.109159247
37 15.96059456
37 2.985292226
36 -4.301136971
35 5.854674875
35 5.797294021
34 4.393329025
33 -6.622580905
32 0.268500302
27 12.23062252
;
run;

I don't think you could necessarily do this completely.
k-means clustering uses euclidean distance between all of the variables you provide it. This means that it's not solely using value to cluster observations: it's using Resid as well.
As such, it's possible a row with a value that seems like it should go with cluster 2 should actually go with cluster 3, if the Resid value is much closer there.
In your example, if you request an out dataset, you will see this is true. A proc freq of that out dataset reveals that cluster 1 has three rows, with values 27, 37, and 38. Cluster 2 has almost all of the rows - all but 7 in total - ranging from 32 to 44. Cluster 3 ranges from 40 to 44.
As such, there's no reasonable way to define your clusters the way you ask with this method of clustering. Clusters are typically defined by their centroid, and that's what you get with the outstat dataset; you can determine which cluster a particular value should be assigned based on this.

Related

Eliminate duplicate pairs of numbers and total their quantities in openoffice Calc

I have a Calc sheet listing a cut-list for plywood in two columns with a quantity in a third column. I would like to remove duplicate matching pairs of dimensions and total the quantity. Starting with:
A B C
25 35 2
25 40 1
25 45 3
25 45 2
35 45 1
35 50 3
40 25 1
40 25 1
Ending with:
A B C
25 35 2
25 40 1
25 45 5
35 45 1
35 50 3
40 25 2
I'm trying to automate this. Currently I have multiple lists which occupy the same page which need to be totaled independently of each other.
Put a unique different ListId, ListCode or ListNumber for each of the lists. Let all rows falling into the same list, have the same value for this field.
Concatenate A & B and form a new column, say, PairAB.
If the list is small and handlable, filter for PairAB and collect totals.
Otherwise, use Grouping and subtotals to get totals for each list and each pair, grouping on ListId and PairAB.
If the list is very large, you are better off taking it to CSV, and onward to a database, such things are simple child's play in SQL.

How to append a new column to my Pandas DataFrame based on a row-based calculation?

Let's say I have a Pandas DataFrame with two columns: 1) user_id, 2) steps (which contains the number of steps on the given date). Now I want to calculate the difference between the number of steps and the number of steps in the preceding measurement (measurements are guaranteed to be in order within my DataFrame).
So basically this comes down to appending an extra column to my DataFrame where the row values of this data frame match the value of the column 'steps' within this same row, minus the value of the 'steps' column in the row above (or 0 if this is the first row). To complicate things further, I want to calculate these differences per user_id, so I want to make sure that I do not subtract the steps values of two rows with different user_id's.
Does anyone have an idea how to get this done with Python 2.7 and Panda?
So an example to illustrate this.
Example input:
user_id steps
1015 48
1015 23
1015 79
1016 10
1016 20
Desired output:
user_id steps d_steps
1015 48 0
1015 23 -25
1015 79 56
2023 10 0
2023 20 10
Your output shows user ids that are not in you orig data but the following does what you want, you will have to replace/fill the NaN values with 0:
In [16]:
df['d_steps'] = df.groupby('user_id').transform('diff')
df.fillna(0, inplace=True)
df
Out[16]:
user_id steps d_steps
0 1015 48 0
1 1015 23 -25
2 1015 79 56
3 1016 10 0
4 1016 20 10
Here we generate the desired column by calling transform on the groupby by object and pass a string which maps to the diff method which subtracts the previous row value. Transform applies a function and returns a series with an index aligned to the df.

Check if a column exists and then sum in SAS

This is my input dataset:
Ref Col_A0 Col_01 Col_02 Col_aa Col_03 Col_04 Col_bb
NYC 10 0 44 55 66 34 44
CHG 90 55 4 33 22 34 23
TAR 10 8 0 25 65 88 22
I need to calculate the % of Col_A0 for a specific reference.
For example % col_A0 would be calculated as
10/(10+0+44+55+66+34+44)=.0395 i.e. 3.95%
So my output should be
Ref %Col_A0 %Rest
NYC 3.95% 96.05%
CHG 34.48% 65.52%
TAR 4.58% 95.42%
I can do this part but the issue is column variables.
Col_A0 and Ref are fixed columns so they will be there in the input every time. But the other columns won't be there. And there can be some additional columns too like Col_10, col_11 till col_30 and col_cc till col_zz.
For example the input data set in some scenarios can be just:
Ref Col_A0 Col_01 Col_02 Col_aa Col_03
NYC 10 0 44 55 66
CHG 90 55 4 33 22
TAR 10 8 0 25 65
So is there a way I can write a SAS code which checks to see if the column exists or not. Or if there is any other better way to do it.
This is my current SAS code written in Enterprise Guide.
PROC SQL;
CREATE TABLE output123 AS
select
ref,
(col_A0/(Sum(Col_A0,Col_01,Col_02,Col_aa,Col_03,Col_04,Col_bb)) FORMAT=PERCENT8.2 AS PERCNT_ColA0,
(1-(col_A0/(Sum(Col_A0,Col_01,Col_02,Col_aa,Col_03,Col_04,Col_bb))) FORMAT=PERCENT8.2 AS PERCNT_Rest
From Input123;
quit;
Scenarios where all the columns are not there I get an error. And if there are additional columns then I miss those. Please advice.
Thanks
I would not use SQL, but would use regular datastep.
data want;
set have;
a0_prop = col_a0/sum(of _numeric_);
run;
If you wanted to do this in SQL, the easiest way is to keep (or transform) the dataset in vertical format, ie, each variable a separate row per ID. Then you don't need to know how many variables there are to figure it out.
If you always want to sum all the numeric columns then just do :
col_A0 / sum(of _numeric_)

Pandas quantile failing with NaN's present

I've encountered an interesting situation while calculating the inter-quartile range. Assuming we have a dataframe such as:
import pandas as pd
index=pd.date_range('2014 01 01',periods=10,freq='D')
data=pd.np.random.randint(0,100,(10,5))
data = pd.DataFrame(index=index,data=data)
data
Out[90]:
0 1 2 3 4
2014-01-01 33 31 82 3 26
2014-01-02 46 59 0 34 48
2014-01-03 71 2 56 67 54
2014-01-04 90 18 71 12 2
2014-01-05 71 53 5 56 65
2014-01-06 42 78 34 54 40
2014-01-07 80 5 76 12 90
2014-01-08 60 90 84 55 78
2014-01-09 33 11 66 90 8
2014-01-10 40 8 35 36 98
# test for q1 values (this works)
data.quantile(0.25)
Out[111]:
0 40.50
1 8.75
2 34.25
3 17.50
4 29.50
# break it by inserting row of nans
data.iloc[-1] = pd.np.NaN
data.quantile(0.25)
Out[115]:
0 42
1 11
2 34
3 12
4 26
The first quartile can be calculated by taking the median of values in the dataframe that fall below the overall median, so we can see what data.quantile(0.25) should have yielded. e.g.
med = data.median()
q1 = data[data<med].median()
q1
Out[119]:
0 37.5
1 8.0
2 19.5
3 12.0
4 17.0
It seems that quantile is failing to provide an appropriate representation of q1 etc. since it is not doing a good job of handling the NaN values (i.e. it works without NaNs, but not with NaNs).
I thought this may not be a "NaN" issue, rather it might be quantile failing to handle even-numbered data sets (i.e. where the median must be calculated as the mean of the two central numbers). However, after testing with dataframes with both even and odd-numbers of rows I saw that quantile handled these situations properly. The problem seems to arise only when NaN values are present in the dataframe.
I would like to use quntile to calculate the rolling q1/q3 values in my dataframe, however, this will not work with NaN's present. Can anyone provide a solution to this issue?
Internally, quantile uses numpy.percentile over the non-null values. When you change the last row of data to NaNs you're essentially left with an array array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.]) in the first column
Calculating np.percentile(array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.]) gives 42.
From the docstring:
Given a vector V of length N, the qth percentile of V is the qth ranked
value in a sorted copy of V. A weighted average of the two nearest
neighbors is used if the normalized ranking does not match q exactly.
The same as the median if q=50, the same as the minimum if q=0
and the same as the maximum if q=100.

Confidence intervals on a matrix of data in SAS

I have the following matrix of data, which I am reading into SAS:
1 5 12 19 13
6 3 1 3 14
2 7 12 19 21
22 24 21 29 18
17 15 22 9 18
It represents 5 different species of animal (the rows) in 5 different areas of an environment (the columns). I want to get a Shannon diversity index for the whole environment, so I sum the rows to get:
48 54 68 79 84
Then calculate the Shannon index from this, to get:
1.5873488
What I need to do, however, is calculate a confidence interval for this Shannon index. So I want to perform a nonparametric bootstrap on the initial matrix.
Can anyone advise how this is possible in SAS?
There are several ways to do this in SAS. I would use proc surveyselect to generate the bootstrap samples, and then calculate the Shannon Index for each replicate. (I didn't know what the Shannon Index was, so my code is just based on what I read on Wikipedia.)
data animals;
input v1-v5;
cards;
1 5 12 19 13
6 3 1 3 14
2 7 12 19 21
22 24 21 29 18
17 15 22 9 18
run;
/* Generate 5000 bootstrap samples, with replacement */
proc surveyselect data=animals method=urs n=5 reps=5000 seed=10024 out=boots;
run;
/* For each replicate, calculate the sum of each variable */
proc means data=boots noprint nway;
class replicate;
var v:;
output out=sums sum=;
run;
/* Calculate the proportions, and p*log(p), which will be used next */
data sums;
set sums;
ttl=sum(of v1-v5);
array ps{*} p1-p5;
array vs{*} v1-v5;
array hs{*} h1-h5;
do i=1 to dim(vs);
ps{i}=vs{i}/ttl;
hs{i}=ps{i}*log(ps{i});
end;
keep replicate h:;
run;
/* Calculate the Shannon Index, again for each replicate */
data shannon;
set sums;
shannon = -sum(of h:);
keep replicate shannon;
run;
We now have a data set, shannon, which contains the Shannon Index calculated for each of 5000 bootstrap samples. You could use this to calculate p-values, but if you just want critical values, you can run proc means (or univariate if you want a 5% value, as I don't think it's possible to get 97.5 quantiles with proc means).
proc means data=shannon mean p1 p5 p95 p99;
var shannon;
run;