creating a single variable using the data on three seperate columns - pca

I have results from three different tests for each participant:Vocabulary > scores out of 104Cloze_Test > scores out of 22Read_hours > in terms of hours (differs between 4-50 hours)So the scores have different scales.I want to create a single variable using the data from these three columns.What is the best way?Standardize them and combine? If yes how to combine them? I did Standardized usingdata$Vocabulary <as.data.frame(scale(data$sVocabulary))but can't combine them.I have tried these:data.Pax %>% mutate(Reading = cbind(sVocab, scloze, sreadhours))I also tried PCA but not sure if this is the right way to use PCA...scoredata<-data[,c("Vocabulary","Cloze_Test","Read_hours")]pca_profscore <- prcomp(scoredata, center = T, scale.= T)data.Pax["Reading"] <- vectorizedscore <- pca_profscore$x[,1]summary(pca_profscore)print(pca_profscore)pca_profscore$rotation[,1]

Related

Use map inside map?

I have two lists: One with some tidy data and another with models made with tidymodels package
data_list <- list(train,test)
model_fits <- list(tree,forest,xgb)
I want to make a new list with a confusion matrix for train and test for every model.
The function that calculates confusion matrix:
ConfMat <-
function(df,data){
df <-
predict(df,new_data = data, type = "class") %>%
mutate(truth = data$NetInc) %>%
conf_mat(truth,.pred_class)}
I have tried to do this (x,y is arbitrary).:
map(data_list,map(model_fits,ConfMat(x,y)))
My problem is that I have no idea how to actually set "x" and "y" right.
PS: double for loop works. I'm asking specifically for map solution or equivalent.
Appreciate all help i can get! cheers
Use an anonymous function -
library(purrr)
result <- map(data_list,function(x) map(model_fits,function(y) ConfMat(x,y)))
result

mutate at - count the number of times specified columns exceed some specified number,

This is a real-world problem but I am describing the reproducible iris example:
I want to count, for each row, the number of times columns with column names containing "Sepal", exceed a number 5. I want to assign the new result in a new column. I want to use dplyr for this task. My attempt is the following:
iris %>% mutate_at(vars(contains("Sepal")),list(greater_than_5=~apply(.,1,function(x) sum(x>5))))
However, I get an error:
dim(X) must have a positive length```
Any ideas?
Without seeing an example of what you want your output to look like and/or explanation, it's hard to determine exactly what you're looking for, so here are two possible solutions (including the one I mentioned in the comments above). The first I think will be unsatisfactory since your count will be the same for all rows because the tibble is static. The second uses R's bread and butter--a tidy (long form) data structure--to count only those "columns" that have non-NA values. There we add a row id that we can group by later, pivot to a tidy form, filter out NAs, count the number of column/parameter names that contain your word of interest, and the pivot back to your original wide form tibble. We need to wrap the value column in list form since the datatypes differ across the different parameters--you could convert everything to a character, of course, but you'll see this lets us recover the original types after unnesting.
library(tidyverse)
min_to_count = 5
title_word = "Sepal"
iris %>% mutate(name_count = ifelse(sum(str_detect(names(.),title_word)) > min_to_count, sum(str_detect(names(.),title_word)), NA_real_))
iris %>%
mutate(id = row_number()) %>%
pivot_longer(
-id,
names_to = "parameter", values_to = "value",
values_ptypes = list(value = list())
) %>%
filter(!is.na(unlist(value))) %>%
group_by(id) %>%
mutate(
name_count = ifelse(
sum(str_detect(parameter, title_word)) > min_to_count,
sum(str_detect(parameter, title_word)), NA_real_
)
) %>%
pivot_wider(
names_from = "parameter", values_from = "value"
) %>%
unnest_legacy()

How to combine dataset with mxnet?

I have two separate folders containing 3D arrays (data), each folder contains files of the same classification. I used mxnet.gluon.data.ArrayDataset() create datasets for each label respectively. Is there a way to combine these two datasets into the final training dataset that combines both classifications? The new data sets are different size.
e.g
A_data = mx.gluon.data.ArrayDataset(list2,label_A )
noA_data = mx.gluon.data.ArrayDataset(list,label_noA)
^ I want to combine A_data and noA_data for a complete dataset.
Additionally, is there an easier way to combine the two folders with its classification into a mxnet dataset from the get-go? That would also solve my problem.
You could create an ArrayDataset that contains both, if list and list2 are both python lists then you could do something like
full_data = mx.gluon.data.dataset.ArrayDataset(list + list2, label_noA + labelA)
where len(label_noA) == len(list) and len(label_A) == len(list2)

Stata: extract p-values and save them in a list

This may be a trivial question, but as an R user coming to Stata I have so far failed to find the correct Google terms to find the answer. I want to do the following steps:
Do a bunch of tests (e.g. lrtest results in a foreach loop)
Extract the p-value from each test and save them in a list of some kind
Have a list I can do further operations on (e.g. perform multiple comparison correction)
So I am wondering how to extract p-values (or similar) from command results and how to save them into a vector-like object that I can work with. Here is some R code that does something similar:
myData <- data.frame(a=rnorm(10), b=rnorm(10), c=rnorm(10)) ## generate some data
pValue <- c()
for (variableName in c("b", "c")) {
myModel <- lm(as.formula(paste("a ~", variableName)), data=myData) ## fit model
pValue <- c(pValue, coef(summary(myModel))[2, "Pr(>|t|)"]) ## extract p-value and save in vector
}
pValue * 2 ## do amazing multiple comparison correction
To me it seems like Stata has much less of a 'programming' mindset to it than R. If you have any general Stata literature recommendations for an R user who can program, that would also be appreciated.
Here is an approach that would save the p-values in a matrix and then you can manipulate the matrix, maybe using Mata or standard matrix manipulation in Stata.
matrix storeMyP = J(2, 1, .) //create empty matrix with 2 (as many variables as we are looping over) rows, 1 column
matrix list storeMyP //look at the matrix
loc n = 0 //count the iterations
foreach variableName of varlist b c {
loc n = `n' + 1 //each iteration, adjust the count
reg a `variableName'
test `variableName' //this does an F-test, but for one variable it's equivalent to a t-test (check: -help test- there is lots this can do
matrix storeMyP[`n', 1] = `r(p)' //save the p-value in the matrix
}
matrix list storeMyP //look at your p-values
matrix storeMyP_2 = 2*storeMyP //replicating your example above
What's going on this that Stata automatically stores certain quantities after estimation and test commands. When the help files say this command stores the following values in r(), you refer to them in single quotes.
It could also be interesting for you to convert the matrix column(s) into variables using svmat storeMyP, or see help svmat for more info.

Column of lists inside a dataframe in R

Lets have the following dataframe inside R:
df <- data.frame(sample=rnorm(1,0,1),params=I(list(list(mean=0,sd=1,dist="Normal"))))
df <- rbind(df,data.frame(sample=rgamma(1,5,5),params=I(list(list(shape=5,rate=5,dist="Gamma")))))
df <- rbind(df,data.frame(sample=rbinom(1,7,0.7),params=I(list(list(size=7,prob=0.7,dist="Binomial")))))
df <- rbind(df,data.frame(sample=rnorm(1,2,3),params=I(list(list(mean=2,sd=3,dist="Normal")))))
df <- rbind(df,data.frame(sample=rt(1,3),params=I(list(list(df=3,dist="Student-T")))))
The first column contains a random number of a probability distribution and the second column stores a list with its parameters and name.
The dataframe df looks like:
sample params
1 0.85102972 0, 1, Normal
2 0.67313218 5, 5, Gamma
3 3.00000000 7, 0.7, ....
4 0.08488487 2, 3, Normal
5 0.95025523 3, Student-T
Q1: How can I have the list of name distributions for all records? df$params$dist does not work. For a single record is easy, for example the third one: df$params[[3]]$dist
Q2: Is there any alternative way of storing data like this? something like a multi-dimensional dataframe? I do not want to add columns for each parameter because it will scatter the dataframe with missing values.
It's probably more natural to store information like this in a pure list structure, than in a data frame:
distList <- list(normal = list(sample=rnorm(1,0,1),params=list(mean=0,sd=1,dist="Normal")),
gamma = list(sample=rgamma(1,5,5),params=list(shape=5,rate=5,dist="Gamma")),
binom = list(sample=rbinom(1,7,0.7),params=list(size=7,prob=0.7,dist="Binomial")),
normal2 = list(sample=rnorm(1,2,3),params=list(mean=2,sd=3,dist="Normal")),
tdist = list(sample=rt(1,3),params=list(df=3,dist="Student-T")))
And then if you want to extract just the distribution name from each, we can use sapply to loop over the list and extract just that piece:
sapply(distList,function(x) x[[2]]$dist)
normal gamma binom normal2 tdist
"Normal" "Gamma" "Binomial" "Normal" "Student-T"
If you absolutely must store this information in a data frame, one way of doing so springs to mind. You're currently using a params column in your data frame to store the parameters associated with the distributions. Perhaps a better way of doing this would be to (i) identify the maximum number of parameters that you'll need for any distribution, (ii) store the distribution names in a field called df$distribution, and (iii) store the parameters in dedicated parameter columns, the meaning of which will have to be decided upon based on the type of distribution.
For instance, any row with df$distribution = 'Normal' should have df$param1 = and df$param2 = . A row with df$distribution='Student' should have df$param1 = and df$param2 = NA. Something like the following:
dg <- data.frame(sample=rnorm(1, 0, 1), distribution='Normal',
param1=0, param2=1)
dg <- rbind(dg, data.frame(sample=rgamma(1, 5, 5),
distribution='Gamma', param1=5, param2=5))
dg <- rbind(dg, data.frame(sample=rt(1, 3), distribution='Student',
param1=3, param2=NA))
It's ugly, but it will give you what you want. And don't worry about the missing values; missing values are a fact of life when dealing with non-trivial data frames. They can be dealt with easily in R by appropriate use of things like na.rm and complete.cases().
Based on the data frame you have above,
sapply(df$params,"[[","dist")
(or lapply if you prefer) would work.
I would probably put at least the names of the distributions in their own column, even if you want the parameters to be in variable-length lists.