mutate at - count the number of times specified columns exceed some specified number, - apply

This is a real-world problem but I am describing the reproducible iris example:
I want to count, for each row, the number of times columns with column names containing "Sepal", exceed a number 5. I want to assign the new result in a new column. I want to use dplyr for this task. My attempt is the following:
iris %>% mutate_at(vars(contains("Sepal")),list(greater_than_5=~apply(.,1,function(x) sum(x>5))))
However, I get an error:
dim(X) must have a positive length```
Any ideas?

Without seeing an example of what you want your output to look like and/or explanation, it's hard to determine exactly what you're looking for, so here are two possible solutions (including the one I mentioned in the comments above). The first I think will be unsatisfactory since your count will be the same for all rows because the tibble is static. The second uses R's bread and butter--a tidy (long form) data structure--to count only those "columns" that have non-NA values. There we add a row id that we can group by later, pivot to a tidy form, filter out NAs, count the number of column/parameter names that contain your word of interest, and the pivot back to your original wide form tibble. We need to wrap the value column in list form since the datatypes differ across the different parameters--you could convert everything to a character, of course, but you'll see this lets us recover the original types after unnesting.
library(tidyverse)
min_to_count = 5
title_word = "Sepal"
iris %>% mutate(name_count = ifelse(sum(str_detect(names(.),title_word)) > min_to_count, sum(str_detect(names(.),title_word)), NA_real_))
iris %>%
mutate(id = row_number()) %>%
pivot_longer(
-id,
names_to = "parameter", values_to = "value",
values_ptypes = list(value = list())
) %>%
filter(!is.na(unlist(value))) %>%
group_by(id) %>%
mutate(
name_count = ifelse(
sum(str_detect(parameter, title_word)) > min_to_count,
sum(str_detect(parameter, title_word)), NA_real_
)
) %>%
pivot_wider(
names_from = "parameter", values_from = "value"
) %>%
unnest_legacy()

Related

lapply with a list of lists

I believe that there must be some related questions in the community, but I failed to find the one very informative to my case:
Basically, I am trying to produce three plots with the lapply function. Below are my codes.
p_grid <- seq(0,1,length.out=20)
prior_uni <- rep(1,20)
prior_bi <- ifelse( p_grid < 0.5 , 0 , 1)
prior_exp <- exp(-5*abs(p_grid-0.5))
prior_list <- list(prior_uni, prior_bi, prior_exp)
ggs <- lapply(prior_list, function(x){
likelihood <- dbinom(6,9, prob = p_grid)
unstd.post <- likelihood*x
std.post <- unstd.post/sum(unstd.post)
plot_post <- plot(p_grid,std.post,type="b", ylim = c(0,max(x)))
mtext(paste0(x))
}
)
By doing so, I get the plots but the mtext function does not work well. Instead of showing the title prior_uni, prior_bi, prior_exp respectively, it gives every single value of the list (e.g., prior_uni) with overlapping each other.
It is a bit confusing to me. According to the plot results, the function within lapply seems to take the three lists of prior_list, not every single value. In other words, x is the three elements of prior_list, not the sixty (3*20) elements, but the function mtext does oppositely.
I hope I have expressed clearly. Look for your responses.
Best regards,
Jilong

PowerQuery: How to replace text with each column name for multiple columns

I'm trying to replace "x" in each column (excepts for the first 2 columns) with the column name in a table with an unknown number of columns but with at least 2 columns.
I found the code to change one column, but I want it to be dynamic:
#"Ersatt värde" = Table.ReplaceValue(Källa,"x", Table.ColumnNames(Källa){2},Replacer.ReplaceText,{Table.ColumnNames(Källa){2}})
Any ideas on how to solve it?
If I understand correctly, I think you can try either approach below:
#"Ersatt värde" =
let
columnsToTransform = List.Skip(Table.ColumnNames(Källa), 2),
accumulated = List.Accumulate(columnsToTransform, Källa, (tableState as table, columnName as text) =>
Table.ReplaceValue(tableState,"x", columnName, Replacer.ReplaceText, {columnName})
)
in accumulated
or:
#"Ersatt värde" =
let
columnsToTransform = List.Skip(Table.ColumnNames(Källa), 2),
transformations = List.Transform(columnsToTransform, (columnName) => {columnName, each
Replacer.ReplaceText(Text.From(_), "x", columnName)}),
transformed = Table.TransformColumns(Källa, transformations)
in transformed,
Both ways follow a similar approach:
Figure out which columns to do replacements in (i.e. all except the first 2 columns)
Loop over columns determined in previous step and actually do the replacement.
I've used Replacer.ReplaceText since that's what you'd used in your question, but I believe this will replace both partial matches and full matches.
If you only want full matches to be replaced, I think you can use Replacer.ReplaceValue instead.

grepl() and lapply to fill missing values

I have the following data as an example:
fruit.region <- data.frame(full =c("US red apple","bombay Asia mango","gold kiwi New Zealand"), name = c("apple", "mango", "kiwi"), country = c("US","Asia","New Zealand"), type = c("red","bombay","gold"))
I would like R to be able to look at other items in the "full" (name) column that don't have values for "name", "country" and "type" and see if they match other items. For instance, if full had a 4th row with "bombay US mango" it would be able to identify that the country should read US, bombay should be under type and mango should be under name.
This is what I have so far, which merely identifies (logically) where the items match:
new.entry <- c("bombay US mango")
split.new.entry <- strsplit(new.entry, " ")
lapply(split.new.entry, function(x){
check = grepl(x, fruit.region, ignore.case=TRUE)
print(check)
})
I'm at a bit of a standstill..I've read through a number of regex posts and the r help guides on grepl but am not able to find a great solution. What I have doesn't fully identify a logical "match" vector so I'm unable to subset and use an if statement to concatenate on different elements. Ideally, I'd like to be able to replace these elements in data.table form as my fruit.region will actually be in a data table. Does anyone have any suggestions on the best approach?
Using the str_detect function from the stringr library. This gives a list, ready to rbind:
library(stringr)
addnewrow <- function(newfruit){
z<-lapply(fruit.region[,2:4], function(x) x[str_detect(new.entry, x)])
z$full <- newfruit
z
}
addnewrow(new.entry)
$name
[1] "mango"
$country
[1] "US"
$type
[1] "bombay"
$full
[1] "bombay US mango"
The next step would depend on your desired outcome - if you only want to add one, try:
rbind(fruit.region, addnewrow(new.entry))
If you have a lot:
z <- do.call(rbind, lapply(c(new.entry, new.entry), addnewrow))
rbind(fruit.region, z)
NB make sure your columns are character first:
fruit.region[] <- lapply(fruit.region, as.character)

Identify the format of a character column in R

I'm dealing with a huge dataset having 500 columns and a huge number of rows out of which I can take a significantly big sample (e.g. 1 million).
All the columns are in the character format although they can represent different data types:numeric, date, ... I need to build a function that, given a column as an input, recognized its format, taking account of NA values as well.
For instance, given a column col, I recognise if it is numeric in this way.
col <- c(as.character(runif(10000)), rep('NaN', 10))
maxPercNa <- 0.10
nNa <- sum(is.na(as.numeric(col)))
percNa <- nNa / length(col)
isNumeric <- percNa < maxPercNa
In a similar way, I need to recognise dates, integers, ... I was thinking about using regular expressions. A challenge is that the dataset is very big, so the technique should be efficient.
If anyone comes up with a brilliant idea, it'll be really appreciated :) Thanks in advance!

Column of lists inside a dataframe in R

Lets have the following dataframe inside R:
df <- data.frame(sample=rnorm(1,0,1),params=I(list(list(mean=0,sd=1,dist="Normal"))))
df <- rbind(df,data.frame(sample=rgamma(1,5,5),params=I(list(list(shape=5,rate=5,dist="Gamma")))))
df <- rbind(df,data.frame(sample=rbinom(1,7,0.7),params=I(list(list(size=7,prob=0.7,dist="Binomial")))))
df <- rbind(df,data.frame(sample=rnorm(1,2,3),params=I(list(list(mean=2,sd=3,dist="Normal")))))
df <- rbind(df,data.frame(sample=rt(1,3),params=I(list(list(df=3,dist="Student-T")))))
The first column contains a random number of a probability distribution and the second column stores a list with its parameters and name.
The dataframe df looks like:
sample params
1 0.85102972 0, 1, Normal
2 0.67313218 5, 5, Gamma
3 3.00000000 7, 0.7, ....
4 0.08488487 2, 3, Normal
5 0.95025523 3, Student-T
Q1: How can I have the list of name distributions for all records? df$params$dist does not work. For a single record is easy, for example the third one: df$params[[3]]$dist
Q2: Is there any alternative way of storing data like this? something like a multi-dimensional dataframe? I do not want to add columns for each parameter because it will scatter the dataframe with missing values.
It's probably more natural to store information like this in a pure list structure, than in a data frame:
distList <- list(normal = list(sample=rnorm(1,0,1),params=list(mean=0,sd=1,dist="Normal")),
gamma = list(sample=rgamma(1,5,5),params=list(shape=5,rate=5,dist="Gamma")),
binom = list(sample=rbinom(1,7,0.7),params=list(size=7,prob=0.7,dist="Binomial")),
normal2 = list(sample=rnorm(1,2,3),params=list(mean=2,sd=3,dist="Normal")),
tdist = list(sample=rt(1,3),params=list(df=3,dist="Student-T")))
And then if you want to extract just the distribution name from each, we can use sapply to loop over the list and extract just that piece:
sapply(distList,function(x) x[[2]]$dist)
normal gamma binom normal2 tdist
"Normal" "Gamma" "Binomial" "Normal" "Student-T"
If you absolutely must store this information in a data frame, one way of doing so springs to mind. You're currently using a params column in your data frame to store the parameters associated with the distributions. Perhaps a better way of doing this would be to (i) identify the maximum number of parameters that you'll need for any distribution, (ii) store the distribution names in a field called df$distribution, and (iii) store the parameters in dedicated parameter columns, the meaning of which will have to be decided upon based on the type of distribution.
For instance, any row with df$distribution = 'Normal' should have df$param1 = and df$param2 = . A row with df$distribution='Student' should have df$param1 = and df$param2 = NA. Something like the following:
dg <- data.frame(sample=rnorm(1, 0, 1), distribution='Normal',
param1=0, param2=1)
dg <- rbind(dg, data.frame(sample=rgamma(1, 5, 5),
distribution='Gamma', param1=5, param2=5))
dg <- rbind(dg, data.frame(sample=rt(1, 3), distribution='Student',
param1=3, param2=NA))
It's ugly, but it will give you what you want. And don't worry about the missing values; missing values are a fact of life when dealing with non-trivial data frames. They can be dealt with easily in R by appropriate use of things like na.rm and complete.cases().
Based on the data frame you have above,
sapply(df$params,"[[","dist")
(or lapply if you prefer) would work.
I would probably put at least the names of the distributions in their own column, even if you want the parameters to be in variable-length lists.