Column of lists inside a dataframe in R - list

Lets have the following dataframe inside R:
df <- data.frame(sample=rnorm(1,0,1),params=I(list(list(mean=0,sd=1,dist="Normal"))))
df <- rbind(df,data.frame(sample=rgamma(1,5,5),params=I(list(list(shape=5,rate=5,dist="Gamma")))))
df <- rbind(df,data.frame(sample=rbinom(1,7,0.7),params=I(list(list(size=7,prob=0.7,dist="Binomial")))))
df <- rbind(df,data.frame(sample=rnorm(1,2,3),params=I(list(list(mean=2,sd=3,dist="Normal")))))
df <- rbind(df,data.frame(sample=rt(1,3),params=I(list(list(df=3,dist="Student-T")))))
The first column contains a random number of a probability distribution and the second column stores a list with its parameters and name.
The dataframe df looks like:
sample params
1 0.85102972 0, 1, Normal
2 0.67313218 5, 5, Gamma
3 3.00000000 7, 0.7, ....
4 0.08488487 2, 3, Normal
5 0.95025523 3, Student-T
Q1: How can I have the list of name distributions for all records? df$params$dist does not work. For a single record is easy, for example the third one: df$params[[3]]$dist
Q2: Is there any alternative way of storing data like this? something like a multi-dimensional dataframe? I do not want to add columns for each parameter because it will scatter the dataframe with missing values.

It's probably more natural to store information like this in a pure list structure, than in a data frame:
distList <- list(normal = list(sample=rnorm(1,0,1),params=list(mean=0,sd=1,dist="Normal")),
gamma = list(sample=rgamma(1,5,5),params=list(shape=5,rate=5,dist="Gamma")),
binom = list(sample=rbinom(1,7,0.7),params=list(size=7,prob=0.7,dist="Binomial")),
normal2 = list(sample=rnorm(1,2,3),params=list(mean=2,sd=3,dist="Normal")),
tdist = list(sample=rt(1,3),params=list(df=3,dist="Student-T")))
And then if you want to extract just the distribution name from each, we can use sapply to loop over the list and extract just that piece:
sapply(distList,function(x) x[[2]]$dist)
normal gamma binom normal2 tdist
"Normal" "Gamma" "Binomial" "Normal" "Student-T"

If you absolutely must store this information in a data frame, one way of doing so springs to mind. You're currently using a params column in your data frame to store the parameters associated with the distributions. Perhaps a better way of doing this would be to (i) identify the maximum number of parameters that you'll need for any distribution, (ii) store the distribution names in a field called df$distribution, and (iii) store the parameters in dedicated parameter columns, the meaning of which will have to be decided upon based on the type of distribution.
For instance, any row with df$distribution = 'Normal' should have df$param1 = and df$param2 = . A row with df$distribution='Student' should have df$param1 = and df$param2 = NA. Something like the following:
dg <- data.frame(sample=rnorm(1, 0, 1), distribution='Normal',
param1=0, param2=1)
dg <- rbind(dg, data.frame(sample=rgamma(1, 5, 5),
distribution='Gamma', param1=5, param2=5))
dg <- rbind(dg, data.frame(sample=rt(1, 3), distribution='Student',
param1=3, param2=NA))
It's ugly, but it will give you what you want. And don't worry about the missing values; missing values are a fact of life when dealing with non-trivial data frames. They can be dealt with easily in R by appropriate use of things like na.rm and complete.cases().

Based on the data frame you have above,
sapply(df$params,"[[","dist")
(or lapply if you prefer) would work.
I would probably put at least the names of the distributions in their own column, even if you want the parameters to be in variable-length lists.

Related

Changing the name of a list on the fly using a counter

I have a set of list,
list_0=[a,b,a,b,b,c,f,h................]
list_1=[f,g,c,g,f,a,b,b,b,.............]
list_2=[...............................]
............
list_j=[...............................]
where j is (k-1), with some thousands of value stored in them. I want to count for how many times a specific value is in a specific list. And I can have only 8 different values (I mean, every single element of those list can only have one out of 8 specific values, let's say a,b,c,d,e,f,g,h; so I want to count for every list how many times there's the value a, how many times the value b, and so on).
This is not so complicated.
What is complicated, at least for me, is to change on the fly the name of the list.
I tried:
for i in range(k):
my_list='list_'+str(int(k))
a_sum=exec(my_list.count(a))
b_sum=exec(my_list.count(b))
...
and it doesn't work.
I've read some other answer to similar problem, but I' not able to translate it to fit my need :-(
Tkx.
What you want is to dynamically access a local variable by its name. That's doable, all you need is locals().
If you have variables with names "var0", "var1" and "var2", but you want to access their content without hardcoding it. You can do it as follows:
var0 = [1,2,3]
var1 = [4,5,6]
var2 = [7,8,9]
for i in range(3):
variable = locals()['var'+str(i)]
print(variable)
Output:
[1, 2, 3]
[4, 5, 6]
[7, 8, 9]
Although doable, it's not advised to do this, you could store those lists in a dict containing their names as string keys, so that later you could access them by simply using a string without needing to take care about variable scopes.
If your names differ just by a number then perhaps you could also use a list, and the number would be the index inside it.

How do I compare two columns at once against two different data frames in python (pandas)?

df1 contains two columns of Lat and Long, and several thousand rows. df2 also contains two columns of lat and long with many rows. Essentially, df2 is a list of reference locations that I want to compare df1 with. I want to compare both the Latitude and Longitude of df1 with df2 to say their locations match, or say they don't. i.e.,
my_data = pd.read_csv('/path/to/file', usecols = ['Lat','Lon'])
reference_data = pd.read_csv('/path/to/file', usecols = ['Lat','Lon'])
In simpler words, I want to say that if the location in each row in my_data is present in reference_data, label it 1, else label it 0. Since this location has two components Lat and Long, they BOTH need to be present next to each other anywhere in the reference dataframe. Is there an easy one-liner?
You could generate this by using the merge function to join the reference_data to my_data with an indicator.
new_df = pd.merge(my_data, reference_data, on=['Lat','Lon'], how='left', indicator='flag')
You'll get a dataframe that should look exactly like my_data but include a new column "flag" which either says "left_only" or "both".
To get it as a [0,1] label:
new_df['bin_flag'] = (new_df['flag']=='both').astype(int)
To my knowledge, there is not an actual one-liner for this one.
you can do also something like:
my_data.apply(lambda x: (x['Lat'] in reference_data['Lat'] and x['Lon'] in reference_data['Lon']) * 1.0, axis=1)
and then you can just assign it wherever you like.
or, the same way but maybe easier to see what's going on:
my_data.apply(lambda x: ((x['Lat'], x['Lon']) in zip(reference_data['Lat'], reference_data['Lon'])) * 1.0, axis=1)

Data input for K means clustering with Scipy, Python?

I have a point dataset with two attributes and I would like to cluster these points based on the attribute values. I want to use K means clustering but I am unsure on how my input data should look like when using Scipy's implementation.
For example should I make a numpy array with each row containing: FID, attribute 1, attribute 2, x-coord, y-coord, or an array of just the attribute values? The attributes are integers and floats.
Each row in your data should be descrete observations and columns should correspond to features or dimensions of your data. For your case: FID, attribute 1, attribute 2, x-coord, y-coord should be on columns and each row should represent observations at different time steps.
from scipy.cluster.vq import kmeans,vq
nbStates = 4
Centers, _ = kmeans(Data, nbStates)
Data_id, _ = vq(Data, Centers)
where Data should be Nx5 matrix where 5 columns should correspond to your 5 features FID, attribute 1, attribute 2, x-coord, y-coord, and N rows corresponding to N observations. In other words reshape your FID data array as column vector and same for other features and horizontally concatenate them and put it as an argument for kmeans function. nbStates represents number of clusters which you expect to see, it should be set up beforehand. What you will get as a result is Centers which is NxM matrix where N corresponds to clusters and M corresponds to number of features in your data. Data_id matrix is a column vector which represents the labels of your data points corresponding to each cluster. It is Nx1 matrix where N is a number of data points.
If you want to cluster solely on the attributes you should create a 2xN matrix (according to the scipy docs), with your attributes as columns and each datapoint as row.
You will probably enhance your results by whitening (normalizing) the data points. Assuming your data have two fields attr1 and attr2 and you have a list dataset containing them the corresponding code whould look like:
from scipy.cluster.vq import kmeans, whiten
data = np.ndarray((2, len(dataset))
for row, d in enumerate(dataset):
data[0, row] = d.attr1
data[1, row] = d.attr2
whitened_data = np.whiten(data)
clusters, _ = scipy.cluster.vq.kmeans(data, 5) # 5 is the number of clusters you assume
assignments, _ = vq(data, clusters)

Stata: extract p-values and save them in a list

This may be a trivial question, but as an R user coming to Stata I have so far failed to find the correct Google terms to find the answer. I want to do the following steps:
Do a bunch of tests (e.g. lrtest results in a foreach loop)
Extract the p-value from each test and save them in a list of some kind
Have a list I can do further operations on (e.g. perform multiple comparison correction)
So I am wondering how to extract p-values (or similar) from command results and how to save them into a vector-like object that I can work with. Here is some R code that does something similar:
myData <- data.frame(a=rnorm(10), b=rnorm(10), c=rnorm(10)) ## generate some data
pValue <- c()
for (variableName in c("b", "c")) {
myModel <- lm(as.formula(paste("a ~", variableName)), data=myData) ## fit model
pValue <- c(pValue, coef(summary(myModel))[2, "Pr(>|t|)"]) ## extract p-value and save in vector
}
pValue * 2 ## do amazing multiple comparison correction
To me it seems like Stata has much less of a 'programming' mindset to it than R. If you have any general Stata literature recommendations for an R user who can program, that would also be appreciated.
Here is an approach that would save the p-values in a matrix and then you can manipulate the matrix, maybe using Mata or standard matrix manipulation in Stata.
matrix storeMyP = J(2, 1, .) //create empty matrix with 2 (as many variables as we are looping over) rows, 1 column
matrix list storeMyP //look at the matrix
loc n = 0 //count the iterations
foreach variableName of varlist b c {
loc n = `n' + 1 //each iteration, adjust the count
reg a `variableName'
test `variableName' //this does an F-test, but for one variable it's equivalent to a t-test (check: -help test- there is lots this can do
matrix storeMyP[`n', 1] = `r(p)' //save the p-value in the matrix
}
matrix list storeMyP //look at your p-values
matrix storeMyP_2 = 2*storeMyP //replicating your example above
What's going on this that Stata automatically stores certain quantities after estimation and test commands. When the help files say this command stores the following values in r(), you refer to them in single quotes.
It could also be interesting for you to convert the matrix column(s) into variables using svmat storeMyP, or see help svmat for more info.

how to apply cell style when using `append` in openpyxl?

I am using openpyxl to create an Excel worksheet. I want to apply styles when I insert the data. The trouble is that the append method takes a list of data and automatically inserts them to cells. I cannot seem to specify a font to apply to this operation.
I can go back and apply a style to individual cells after-the-fact, but this requires overhead to find out how many data points were in the list, and which row I am currently appending to. Is there an easier way?
This illustrative code shows what I would like to do:
def create_xlsx(self, header):
self.ft_base = Font(name='Calibri', size=10)
self.ft_bold = self.ft_base.copy(bold=True)
if header:
self.ws.append(header, font=ft_bold) # cannot apply style during append
ws.append() is designed for appending rows of data easily. It does, however, also allow you to include placeless cells within a row so that you can apply formatting while adding data. This is primarily of interest when using write_only=True but will work for normal workbooks.
Your code would look something like:
data = [1, 3, 4, 9, 10]
def styled_cells(data):
for c in data:
if c == 1:
c = Cell(ws, column="A", row=1, value=c)
c.font = Font(bold=True)
yield c
ws.append(styled_cells(data))
openpyxl will correct the coordinates of such cells.