Exclude a few columns from a grouped selection by `dplyr::contains` - regex

Suppose a data frame with several groups of columns (linked by their names, here Bla and D):
df = data.frame(A=1, BlaTata=2, BlaTato=3, BlaTota=4, BlaToto=5,
C=6, D1=7, D2=8, D3=9, D4=10)
# A BlaTata BlaTato BlaTota BlaToto C D1 D2 D3 D4
# 1 2 3 4 5 6 7 8 9 10
How can I easily drop all columns containing Bla (i.e., select(-contains('Bla'))) except for a few of them that I would explicitely "protect" from the (de)selection procedure?
Supposing I want to "protect" BlaTato and BlaToto:
df %>% mutate(saveBlaToto=BlaToto, saveBlaTato=BlaTato) %>%
select(-starts_with('Bla')) %>%
mutate(BlaToto=saveBlaToto, BlaTato=saveBlaTato) %>%
select(-contains('save')) %>%
select(order(colnames(.)))
# A BlaTato BlaToto C D1 D2 D3 D4
# 1 3 5 6 7 8 9 10
There must be an easier and more elegant way ;-)
Supposing it is not handy to select by column index etc.
Something like select(-contains('Bla' but keep c('BlaTato','BlaToto'))) possibly for several columns to be preserved...
EDIT
This question is answered in Frank's "New Question" below.
The original question, simpler and answered in his "First Question", was "How to drop all columns containing B except from B2 in the following data frame":
df = data.frame(A=1, B1=2, B2=3, B3, B4=5, C=6, D1=7, D2=8, D3=9, D4=10)

First question. If you look at ?select, you'll see that you can enter a regular expression, like
# example
df = data.frame(A=1, B1=2, B2=3, B3=4, B4=5, C=6, D1=7, D2=8, D3=9, D4=10)
# goal: drop B, protect B2
df %>% select(-matches('^B[^2]$'))
A B2 C D1 D2 D3 D4
1 1 3 6 7 8 9 10
Reading the regex:
^ and $ indicate start and end of the string.
[^x] means any character except x.
New question. It looks like dplyr doesn't support Perl-style regexes yet, so...
# example
df = data.frame(A=1, BlaTata=2, BlaTato=3, BlaTota=4, BlaToto=5,
C=6, D1=7, D2=8, D3=9, D4=10)
# goal: drop Bla, protect BlaTato, BlaToto
df %>% select(-grep('^Bla(?!Tato|Toto)', names(.), perl=TRUE))
A BlaTato BlaToto C D1 D2 D3 D4
1 1 3 5 6 7 8 9 10
Reading the regex:
(?!xyz) means "don't be followed by xyz"
x|y means x or y
For more info on regular expressions and the base R functions for using them, read ?regex and ?grep. Really, though, you shouldn't name your columns like this. If you find yourself in a position where you need to parse column names, you probably made a mistake earlier on.

Related

Use a regular expression extract substring from data frame columns in R

I am fairly new to R so please go easy on me if this is a stupid question.
I have a dataframe called foo:
< head(foo)
Old.Clone.Name New.Clone.Name File
1 A Aa A_mask_MF_final_IS2_SAEE7-1_02.nrrd
2 B Bb B_mask_MF_final_IS2ViaIS2h_SADQ15-1_02.nrrd
3 C Cc C_mask_MF_final_IS2ViaIS2h_SAEC16-1_02.nrrd
4 D Dd D_mask_MF_final_IS2ViaIS2h_SAEJ6-1_02.nrrd
5 E Ee F_mask_MF_final_IS2_SAED9-1_02.nrrd
6 F Ff F_mask_MF_final_IS2ViaIS2h_SAGP3-1_02.nrrd
I want to extract codes from the File column that match the regular expression (S[A-Z]{3}[0-9]{1,2}-[0-9]_02), to give me:
SAEE7-1_02
SADQ15-1_02
SAEC16-1_02
SAEJ6-1_02
SAED9-1_02
SAGP3-1_02
I then want to use these codes to search another directory for other files that contain the same code.
I fail, however, at the first hurdle and cannot extract the codes from that column of the data frame.
I have tried:
library('stringr')
str_extract(foo[3],regex("(S[A-Z]{3}[0-9]{1,2}-[0-9]_02)", ignore_case = TRUE))
but this just returns [1] NA.
Am I simply missing something obvious? I look forward to cracking this with a bit of help from the community.
Hello if you are reading the data as a table file then foo[3] is a list and str_extract does not accept lists, only strings, then you should use lapply to extract the match of every element.
lapply(foo[3], function(x) str_extract(x, "[sS][a-zA-Z]{3}[0-9]{1,2}-[0-9]_02"))
Result:
[1] "SAEE7-1_02" "SADQ15-1_02" "SAEC16-1_02" "SAEJ6-1_02" "SAED9-1_02"
[6] "SAGP3-1_02"
str_extract(foo[3],"(?i)S[A-Z]{3}[0-9]{1,2}-[0-9]_02")
seems to work. Somehow, my R gave me
"Error in check_pattern(pattern, string) : could not find function "regex""
when using your original expression.
The following code will repeat what you asked (just copy and paste to your R console):
library(stringr)
foo = scan(what='')
Old.Clone.Name New.Clone.Name File
A Aa A_mask_MF_final_IS2_SAEE7-1_02.nrrd
B Bb B_mask_MF_final_IS2ViaIS2h_SADQ15-1_02.nrrd
C Cc C_mask_MF_final_IS2ViaIS2h_SAEC16-1_02.nrrd
D Dd D_mask_MF_final_IS2ViaIS2h_SAEJ6-1_02.nrrd
E Ee F_mask_MF_final_IS2_SAED9-1_02.nrrd
F Ff F_mask_MF_final_IS2ViaIS2h_SAGP3-1_02.nrrd
foo = matrix(foo,ncol=3,byrow=T)
colnames(foo)=foo[1,]
foo = foo[-1,]
foo
str_extract(foo[,3],regex("(S[A-Z]{3}[0-9]{1,2}-[0-9]_02)", ignore_case = T))
The reason you get NULL is hidden: R stores entries by column, hence foo[3] is the 3rd row and 1st column of foo matrix/data frame. To quote the third column, you may need to use foo[,3]. or foo<-data.frame(foo); foo[[3]].

R: How to group and aggregate list elements using regex?

I want to aggregate (sum up) the following product list by groups (see below):
prods <- list("101.2000"=data.frame(1,2,3),
"102.2000"=data.frame(4,5,6),
"103.2000"=data.frame(7,8,9),
"104.2000"=data.frame(1,2,3),
"105.2000"=data.frame(4,5,6),
"106.2000"=data.frame(7,8,9),
"101.2001"=data.frame(1,2,3),
"102.2001"=data.frame(4,5,6),
"103.2001"=data.frame(7,8,9),
"104.2001"=data.frame(1,2,3),
"105.2001"=data.frame(4,5,6),
"106.2001"=data.frame(7,8,9))
test= list("100.2000"=data.frame(2,3,5),
"100.2001"=data.frame(4,5,6))
names <- c("A", "B", "C")
prods <- lapply(prods, function (x) {colnames(x) <- names; return(x)})
Each element of the product list (prods) has a name combination of the product number and the year (e.g. 101.2000 --> 101 = prod nr. and 2000 = year). And the groups only contain product numbers for the aggregation.
group1 <- c(101, 106)
group2 <- c(102, 104)
group3 <- c(105, 103)
My expected result, shows the aggregated product groups by year:
$group1.2000
A B C
1 8 10 12
$group2.2000
A B C
1 5 7 9
$group3.2000
A B C
1 11 13 15
$group1.2001
A B C
1 8 10 12
$group2.2001
A B C
1 5 7 9
$group3.2001
A B C
1 11 13 15
So far, I tried this way: First I decomposed the names of prods into product numbers:
prodnames <- names(prods)
prodnames_sub <- gsub("\\..*.","", prodnames)
And then I tried to aggregate using lapply:
lapply(prods, function(x) aggregate( ... , FUN = sum)
However, I didn't find how to implement the previous product numbers in the aggregation function. Ideas? Thanks
Here are two approaches. No packages are used in either one.
1) Using lists Create a two column data.frame S from the groups whose columns are the products (value column) and associated groups (ind column). Create the list to split by, By. In code to produce By, sub("\\.*", "", names(prods)) extracts the products and match is then used to find the associated group. sub("\\..*", "", names(prods)) extracts the year. Next perform the split and lapply over it to run the summations. The two components of By (group and year) can be reversed to change the order of the output, if desired.
S <- stack(list(group1 = group1, group2 = group2, group3 = group3))
By <- list(group = S$ind[match(sub("\\..*", "", names(prods)), S$values)],
year = sub(".*\\.", "", names(prods)))
lapply(split(prods, By), function(x) colSums(do.call(rbind, x)))
2) Using data.frames Convert the groups and prods each to a data frame, merge them, perform an aggregate and split back into a list. The output is the same as requested except for order. (Reverse the two right hand variables in the aggregate formula to get the order shown in the question but that will also reverse the two parts of each component name in he output list.)
S <- stack(list(group1 = group1, group2 = group2, group3 = group3))
DF0 <- do.call(rbind, prods)
DF <- cbind(do.call(rbind, strsplit(rownames(DF0), ".", fixed = TRUE)), DF0)
M <- merge(DF, S, all.x = TRUE, by = 1)
Ag <- aggregate(cbind(A, B, C) ~ ind + `2`, M, sum)
lapply(split(Ag, paste(Ag[[1]], Ag[[2]], sep = ".")), "[", 3:5)
giving:
$group1.2000
A B C
1 8 10 12
$group1.2001
A B C
4 8 10 12
$group2.2000
A B C
2 5 7 9
$group2.2001
A B C
5 5 7 9
$group3.2000
A B C
3 11 13 15
$group3.2001
A B C
6 11 13 15

R separating out number and units from a column in a dataframe

I have a dataframe which contains a column that has numbers as well as variable units:
num <- c(1:5)
val <- c("5%","10K", "100.2mv","1.4g","1.007kbars")
df <- data.frame(num,val)
df
How can I create two new columns from df$val, one that contains just the number and one the units?
Thank you for your help.
Here's a solution using stringr:
library(stringr)
df$extr_nums <- str_extract(val, "\\d+\\.?\\d*")
df$extr_units <- str_replace(val, nums, "")
df
num val extr_nums extr_units
1 1 5% 5 %
2 2 10K 10 K
3 3 100.2mv 100.2 mv
4 4 1.4g 1.4 g
5 5 1.007kbars 1.007 kbars
The regexp is translated as: "at least 1 digit, followed by optional dot, followed by optional digits".

R function(): how to pass parameters which contain characters and regular expression

my data as follows:
>df2
id calmonth product
1 101 01 apple
2 102 01 apple&nokia&htc
3 103 01 htc
4 104 01 apple&htc
5 104 02 nokia
Now i wanna calculate the number of ids whose products contain both 'apple' and 'htc' when calmonth='01'. Because what i need is not only 'apple' and 'htc', also i need 'apple' and 'nokia',etc.
So i want to realize this by a function like this:
xandy=function(a,b) data.frame(product=paste(a,b,sep='&'),
csum=length(grep('a.*b',x=df2$product))
)
also, i make a parameters list like this:
para=c('apple','htc','nokia')
but the problem is here. When i pass parameters like
xandy(para[1],para[2])
the results is as follows:
product csum
1 apple&htc 0
What my expecting result should be
product csum calmonth
1 apple&htc 2 01
2 apple&htc 0 02
So where is wrong about the parameters passing?
and, how can i add the calmonth in to the function() xandy correctly?
FYI.This question stems from my another question before
What's the R statement responding to SQL's 'in' statement
EDIT AFTER COMMENT
My predictive result will be:
product csum calmonth
1 apple&htc 2 01
2 apple&htc 0 02
May answer is another way how to tackle your problem.
library(stringr)
The function contains will split up the elements of a string vector according to the split character and evaluate if all target words are contained.
contains <- function(x, target, split="&") {
l <- str_split(x, split)
sapply(l, function(x, y) all(y %in% x), y=target)
}
contains(d$product, c("apple", "htc"))
[1] FALSE TRUE FALSE TRUE FALSE
The rest is just subsetting and summarizing
get_data <- function(a, b) {
e <- subset(d, contains(product, c(a, b)))
e$product2 <- paste(a, b, sep="&")
ddply(e, .(calmonth, product2), summarise, csum=length(id))
}
Using the data below, order does not play a role now anymore (see comment below).
get_data("apple", "htc")
calmonth product2 csum
1 1 apple&htc 1
2 2 apple&htc 2
get_data("htc", "apple")
calmonth product2 csum
1 1 htc&apple 1
2 2 htc&apple 2
I know this is not a direct answer to your question but I find this approach quite clean.
EDIT AFTER COMMENT
The reason that you get csum=0 is simply that you are searching for the wrong regex pattern, i.e. a something in between b not for apple ... htc. You need to construct the correct regex pattern,i.e. paste0(a, ".*", b).
Here a complete solution. I would not call it beautiful code, but anyway (note that I change the data to show that it generalizes for months).
library(plyr)
df2 <- read.table(text="
id calmonth product
101 01 apple
102 01 apple&nokia&htc
103 01 htc
104 02 apple&htc
104 02 apple&htc", header=T)
xandy <- function(a, b) {
pattern <- paste0(a, ".*", b)
d1 <- df2[grep(pattern, df2$product), ]
d1$product <- paste0(a,"&", b)
ddply(d1, .(calmonth), summarise,
csum=length(calmonth),
product=unique(product))
}
xandy("apple", "htc")
calmonth csum product
1 1 1 apple&htc
2 2 2 apple&htc

R- regex index of start postion and then add it to a string?

So far i have been able to merge two files and get the following dataframe (df1):
ID someLength someLongerSeq someSeq someMOD someValue
A 16 XCVBNMHGFDSTHJGF NMH T3(P) 7
A 16 XCVBNMHGFDSTHJGF NmH M3(O); S4(P); S6(P) 1
B 24 HDFGKJSDHFGKJSDFHGKLSJDF HFGKJSDFH S9(P) 5
C 22 QIOWEURQOIWERERQWEFFFF RQoIWERER Q16(D); S19(P) 7
D 19 HSEKDFGSFDKELJGFZZX KELJ S7(P); C9(C); S10(P) 1
i am looking for a way to do a regex match based on "someSeq" column to look for that substring in the "someLongersSeq" column and get the start location of the match and then add that to the whole numbers that are attached to the characters such as T3(P).
Example:
For the second row "ID:A","someSeq":"NmH" matches starts at location 4 of the someLongerSeq (after to upper conversion of NmH). So i want to add that number 4 to someMOD fields M3(O);S4(P);S6(P) so that i get M7(O);S8(P);S10(P) and then overwrite the new value in the someMOD column.
And do that for each row. Regex is per row bases.
Any help is really appreciated. Thanks.
First of all, I should mention that it is hard to read your data. I slightly modify it( I remove spaces from someMOD column) to read them. This is not a problem since you have already your data into a data.frame. So I read the data like this :
dat <- read.table(text='ID someLength someLongerSeq someSeq someMOD someValue
A 16 XCVBNMHGFDSTHJGF NMH T3(P) 7
A 16 XCVBNMHGFDSTHJGF NmH M3(O);S4(P);S6(P) 1
B 24 HDFGKJSDHFGKJSDFHGKLSJDF HFGKJSDFH S9(P) 5
C 22 QIOWEURQOIWERERQWEFFFF RQoIWERER Q16(D);S19(P) 7
D 19 HSEKDFGSFDKELJGFZZX KELJ S7(P);C9(C);S10(P) 1',header=TRUE)
Then the idea is:
to process row by row using apply
use gregexpr to get the index of someSeq into someLongerSeq
use gsubfn to add the previous index to its digit of someMOD
Here the whole solution:
library(gsubfn)
res <- t(apply(dat,1,function(x){
idx <- gregexpr(x['someSeq'],x['someLongerSeq'],
ignore.case = TRUE)[[1]][1]
x[['someMOD']] <- gsubfn("[[:digit:]]+",
function(x) as.numeric(x)+idx,
x[['someMOD']])
x
}))
as.data.frame(res)
ID someLength someLongerSeq someSeq someMOD someValue
1 A 16 XCVBNMHGFDSTHJGF NMH T8(P) 7
2 A 16 XCVBNMHGFDSTHJGF NmH M8(O);S9(P);S11(P) 1
3 B 24 HDFGKJSDHFGKJSDFHGKLSJDF HFGKJSDFH S18(P) 5
4 C 22 QIOWEURQOIWERERQWEFFFF RQoIWERER Q23(D);S26(P) 7
5 D 19 HSEKDFGSFDKELJGFZZX KELJ S18(P);C20(C);S21(P) 1