Data input for K means clustering with Scipy, Python? - python-2.7

I have a point dataset with two attributes and I would like to cluster these points based on the attribute values. I want to use K means clustering but I am unsure on how my input data should look like when using Scipy's implementation.
For example should I make a numpy array with each row containing: FID, attribute 1, attribute 2, x-coord, y-coord, or an array of just the attribute values? The attributes are integers and floats.

Each row in your data should be descrete observations and columns should correspond to features or dimensions of your data. For your case: FID, attribute 1, attribute 2, x-coord, y-coord should be on columns and each row should represent observations at different time steps.
from scipy.cluster.vq import kmeans,vq
nbStates = 4
Centers, _ = kmeans(Data, nbStates)
Data_id, _ = vq(Data, Centers)
where Data should be Nx5 matrix where 5 columns should correspond to your 5 features FID, attribute 1, attribute 2, x-coord, y-coord, and N rows corresponding to N observations. In other words reshape your FID data array as column vector and same for other features and horizontally concatenate them and put it as an argument for kmeans function. nbStates represents number of clusters which you expect to see, it should be set up beforehand. What you will get as a result is Centers which is NxM matrix where N corresponds to clusters and M corresponds to number of features in your data. Data_id matrix is a column vector which represents the labels of your data points corresponding to each cluster. It is Nx1 matrix where N is a number of data points.

If you want to cluster solely on the attributes you should create a 2xN matrix (according to the scipy docs), with your attributes as columns and each datapoint as row.
You will probably enhance your results by whitening (normalizing) the data points. Assuming your data have two fields attr1 and attr2 and you have a list dataset containing them the corresponding code whould look like:
from scipy.cluster.vq import kmeans, whiten
data = np.ndarray((2, len(dataset))
for row, d in enumerate(dataset):
data[0, row] = d.attr1
data[1, row] = d.attr2
whitened_data = np.whiten(data)
clusters, _ = scipy.cluster.vq.kmeans(data, 5) # 5 is the number of clusters you assume
assignments, _ = vq(data, clusters)

Related

How to count a value in a range using array formula

I want to count the number of No in theses ranges F:R,BC:BN,CX:DI with array formula so if any one submit a new response containing No in these ranges it counts them
I tried using this formula
=ARRAYFORMULA(IF(ROW(E:E)=1,"NC",IF(LEN(E:E), IF(IFERROR(REGEXEXTRACT(TRANSPOSE(QUERY(TRANSPOSE(COUNTIFS(OR(DV:EG="No",BW:CH="No",U:AG="No"))),, 999^99)), "♦"))="♦", 1, 0), )))
but it didn't work, I also tried this formula:
=ARRAYFORMULA(IF(ROW(A:A)=1,"NC",IF(LEN(A:A)=0,IFERROR(1/0),COUNTIFS(F:R,"No")+COUNTIFS(BC:BN,"No")+COUNTIFS(CX:DI,"No"))))
But it counted all the value in the whole range
I need it to count the No row by row so at the end of every row under NC it shows the number of the No in these ranges F:R,BC:BN,CX:DI
Here is a spread sheet containing the data
https://docs.google.com/spreadsheets/d/1SksZv0h82j5oEZBj2AN5anDFr80AYNR5ettSwkpUKys/edit#gid=0
=ARRAYFORMULA({"NC"; IF(LEN(A2:A),
MMULT(IFERROR(LEN(REGEXEXTRACT({F2:R,BC2:BN,CX2:DI}, "No"))/
LEN(REGEXEXTRACT({F2:R,BC2:BN,CX2:DI}, "No")), 0),
TRANSPOSE(COLUMN(A1:AK1)^0)), )})

How to use the dimension of a python matrix in a loop

I am working with a matrix, lets call it X, in python.
I know how to get the dimension of the matrix using X.shape but I am interested specially on using the number of rows of the matrix in a for loop, and I dont know how to get this value in a datatype suitable for a loop.
For example, imagine tihs simple situation:
a = np.matrix([[1,2,3],[4,5,6]])
for i in 1:(number of rows of a)
print i
How can I get automatically that "number of rows of a"?
X.shape[0] == number of rows in X
A superficial search on numpy will lead you to shape. It returns a tuple of array dimensions.
In your case, the first dimension (axe) concerns the columns. You can access it as you access a tuple's element:
import numpy as np
a = np.matrix([[1,2,3],[4,5,6]])
# a. shape[1]: columns
for i in range(0,a.shape[1]):
print 'column '+format(i)
# a. shape[0]: rows
for i in range(0, a.shape[0]):
print 'row '+format(i)
This will print:
column 0
column 1
column 2
row 0
row 1

Based on count value i have to create number of rows,is that possible without java transformation?

Hey guys anyone know how to create number of rows based on the count value without using java transformation in informatica 9.6(For flat file).Please help me with that
You can create an auxiliary table with n rows for each possible count value between 1 and N:
1
2
2
3
3
3
...
...
N rows with the last value
...
N rows with the last value
Join this table to the source data using the n count value as the key and you will get n copies of each source row.

Stata: extract p-values and save them in a list

This may be a trivial question, but as an R user coming to Stata I have so far failed to find the correct Google terms to find the answer. I want to do the following steps:
Do a bunch of tests (e.g. lrtest results in a foreach loop)
Extract the p-value from each test and save them in a list of some kind
Have a list I can do further operations on (e.g. perform multiple comparison correction)
So I am wondering how to extract p-values (or similar) from command results and how to save them into a vector-like object that I can work with. Here is some R code that does something similar:
myData <- data.frame(a=rnorm(10), b=rnorm(10), c=rnorm(10)) ## generate some data
pValue <- c()
for (variableName in c("b", "c")) {
myModel <- lm(as.formula(paste("a ~", variableName)), data=myData) ## fit model
pValue <- c(pValue, coef(summary(myModel))[2, "Pr(>|t|)"]) ## extract p-value and save in vector
}
pValue * 2 ## do amazing multiple comparison correction
To me it seems like Stata has much less of a 'programming' mindset to it than R. If you have any general Stata literature recommendations for an R user who can program, that would also be appreciated.
Here is an approach that would save the p-values in a matrix and then you can manipulate the matrix, maybe using Mata or standard matrix manipulation in Stata.
matrix storeMyP = J(2, 1, .) //create empty matrix with 2 (as many variables as we are looping over) rows, 1 column
matrix list storeMyP //look at the matrix
loc n = 0 //count the iterations
foreach variableName of varlist b c {
loc n = `n' + 1 //each iteration, adjust the count
reg a `variableName'
test `variableName' //this does an F-test, but for one variable it's equivalent to a t-test (check: -help test- there is lots this can do
matrix storeMyP[`n', 1] = `r(p)' //save the p-value in the matrix
}
matrix list storeMyP //look at your p-values
matrix storeMyP_2 = 2*storeMyP //replicating your example above
What's going on this that Stata automatically stores certain quantities after estimation and test commands. When the help files say this command stores the following values in r(), you refer to them in single quotes.
It could also be interesting for you to convert the matrix column(s) into variables using svmat storeMyP, or see help svmat for more info.

Column of lists inside a dataframe in R

Lets have the following dataframe inside R:
df <- data.frame(sample=rnorm(1,0,1),params=I(list(list(mean=0,sd=1,dist="Normal"))))
df <- rbind(df,data.frame(sample=rgamma(1,5,5),params=I(list(list(shape=5,rate=5,dist="Gamma")))))
df <- rbind(df,data.frame(sample=rbinom(1,7,0.7),params=I(list(list(size=7,prob=0.7,dist="Binomial")))))
df <- rbind(df,data.frame(sample=rnorm(1,2,3),params=I(list(list(mean=2,sd=3,dist="Normal")))))
df <- rbind(df,data.frame(sample=rt(1,3),params=I(list(list(df=3,dist="Student-T")))))
The first column contains a random number of a probability distribution and the second column stores a list with its parameters and name.
The dataframe df looks like:
sample params
1 0.85102972 0, 1, Normal
2 0.67313218 5, 5, Gamma
3 3.00000000 7, 0.7, ....
4 0.08488487 2, 3, Normal
5 0.95025523 3, Student-T
Q1: How can I have the list of name distributions for all records? df$params$dist does not work. For a single record is easy, for example the third one: df$params[[3]]$dist
Q2: Is there any alternative way of storing data like this? something like a multi-dimensional dataframe? I do not want to add columns for each parameter because it will scatter the dataframe with missing values.
It's probably more natural to store information like this in a pure list structure, than in a data frame:
distList <- list(normal = list(sample=rnorm(1,0,1),params=list(mean=0,sd=1,dist="Normal")),
gamma = list(sample=rgamma(1,5,5),params=list(shape=5,rate=5,dist="Gamma")),
binom = list(sample=rbinom(1,7,0.7),params=list(size=7,prob=0.7,dist="Binomial")),
normal2 = list(sample=rnorm(1,2,3),params=list(mean=2,sd=3,dist="Normal")),
tdist = list(sample=rt(1,3),params=list(df=3,dist="Student-T")))
And then if you want to extract just the distribution name from each, we can use sapply to loop over the list and extract just that piece:
sapply(distList,function(x) x[[2]]$dist)
normal gamma binom normal2 tdist
"Normal" "Gamma" "Binomial" "Normal" "Student-T"
If you absolutely must store this information in a data frame, one way of doing so springs to mind. You're currently using a params column in your data frame to store the parameters associated with the distributions. Perhaps a better way of doing this would be to (i) identify the maximum number of parameters that you'll need for any distribution, (ii) store the distribution names in a field called df$distribution, and (iii) store the parameters in dedicated parameter columns, the meaning of which will have to be decided upon based on the type of distribution.
For instance, any row with df$distribution = 'Normal' should have df$param1 = and df$param2 = . A row with df$distribution='Student' should have df$param1 = and df$param2 = NA. Something like the following:
dg <- data.frame(sample=rnorm(1, 0, 1), distribution='Normal',
param1=0, param2=1)
dg <- rbind(dg, data.frame(sample=rgamma(1, 5, 5),
distribution='Gamma', param1=5, param2=5))
dg <- rbind(dg, data.frame(sample=rt(1, 3), distribution='Student',
param1=3, param2=NA))
It's ugly, but it will give you what you want. And don't worry about the missing values; missing values are a fact of life when dealing with non-trivial data frames. They can be dealt with easily in R by appropriate use of things like na.rm and complete.cases().
Based on the data frame you have above,
sapply(df$params,"[[","dist")
(or lapply if you prefer) would work.
I would probably put at least the names of the distributions in their own column, even if you want the parameters to be in variable-length lists.