Use a regular expression extract substring from data frame columns in R

Use a regular expression extract substring from data frame columns in R - regex

I am fairly new to R so please go easy on me if this is a stupid question.
I have a dataframe called foo:
< head(foo)
Old.Clone.Name New.Clone.Name File
1 A Aa A_mask_MF_final_IS2_SAEE7-1_02.nrrd
2 B Bb B_mask_MF_final_IS2ViaIS2h_SADQ15-1_02.nrrd
3 C Cc C_mask_MF_final_IS2ViaIS2h_SAEC16-1_02.nrrd
4 D Dd D_mask_MF_final_IS2ViaIS2h_SAEJ6-1_02.nrrd
5 E Ee F_mask_MF_final_IS2_SAED9-1_02.nrrd
6 F Ff F_mask_MF_final_IS2ViaIS2h_SAGP3-1_02.nrrd
I want to extract codes from the File column that match the regular expression (S[A-Z]{3}[0-9]{1,2}-[0-9]_02), to give me:
SAEE7-1_02
SADQ15-1_02
SAEC16-1_02
SAEJ6-1_02
SAED9-1_02
SAGP3-1_02
I then want to use these codes to search another directory for other files that contain the same code.
I fail, however, at the first hurdle and cannot extract the codes from that column of the data frame.
I have tried:
library('stringr')
str_extract(foo[3],regex("(S[A-Z]{3}[0-9]{1,2}-[0-9]_02)", ignore_case = TRUE))
but this just returns [1] NA.
Am I simply missing something obvious? I look forward to cracking this with a bit of help from the community.

Hello if you are reading the data as a table file then foo[3] is a list and str_extract does not accept lists, only strings, then you should use lapply to extract the match of every element.
lapply(foo[3], function(x) str_extract(x, "[sS][a-zA-Z]{3}[0-9]{1,2}-[0-9]_02"))
Result:
[1] "SAEE7-1_02" "SADQ15-1_02" "SAEC16-1_02" "SAEJ6-1_02" "SAED9-1_02"
[6] "SAGP3-1_02"

str_extract(foo[3],"(?i)S[A-Z]{3}[0-9]{1,2}-[0-9]_02")
seems to work. Somehow, my R gave me
"Error in check_pattern(pattern, string) : could not find function "regex""
when using your original expression.

The following code will repeat what you asked (just copy and paste to your R console):
library(stringr)
foo = scan(what='')
Old.Clone.Name New.Clone.Name File
A Aa A_mask_MF_final_IS2_SAEE7-1_02.nrrd
B Bb B_mask_MF_final_IS2ViaIS2h_SADQ15-1_02.nrrd
C Cc C_mask_MF_final_IS2ViaIS2h_SAEC16-1_02.nrrd
D Dd D_mask_MF_final_IS2ViaIS2h_SAEJ6-1_02.nrrd
E Ee F_mask_MF_final_IS2_SAED9-1_02.nrrd
F Ff F_mask_MF_final_IS2ViaIS2h_SAGP3-1_02.nrrd
foo = matrix(foo,ncol=3,byrow=T)
colnames(foo)=foo[1,]
foo = foo[-1,]
foo
str_extract(foo[,3],regex("(S[A-Z]{3}[0-9]{1,2}-[0-9]_02)", ignore_case = T))
The reason you get NULL is hidden: R stores entries by column, hence foo[3] is the 3rd row and 1st column of foo matrix/data frame. To quote the third column, you may need to use foo[,3]. or foo<-data.frame(foo); foo[[3]].

Related

Excel | Get all column/row names in which a specific text is as a list

It is difficult for me to describe the problem in the title, so excuse any misleading description.
The easiest way to describe what I need is with an example. I have a table like:
A B C
1 x
2 x x
3 x x
Now what I want is the formula in a cell for every single column and row with each of the column or row name for every x that is placed. In the example like:
A B C
1,2 2,3 3
1 A x
2 A, B x x
3 B, C x x
The column and row names are not equivalent to the excel designation. It works with an easy WHEN statement for single cells (=WHEN(C3="x";C1)), but not for a bunch of them (=WHEN(C3:E3="x";C1:E1)). How should/can such a formula look like?

So I found the answer to my problem. Excel provides the normal CONCATENATE function. What is needed is something like a CONCATENATEIF (in German = verkettenwenn) function. By adding a module in VBA based on a thread from ransi from 2011 on the ms-office-forum.net the function verkettenwenn can be used. The code for the German module looks like:
Option Explicit
Public Function verkettenwenn(Bereich_Kriterium, Kriterium, Bereich_Verketten)
Dim mydic As Object
Dim L As Long
Set mydic = CreateObject("Scripting.Dictionary")
For L = 1 To Bereich_Kriterium.Count
If Bereich_Kriterium(L) = Kriterium Then
mydic(L) = Bereich_Verketten(L)
End If
Next
verkettenwenn = Join(mydic.items, ", ")
End Function
With that module in place one of the formula for the mentioned example looks like: =verkettenwenn(C3:E3;"x";$C$1:$K$1)
The English code for a CONCATENATEIF function should probably be:
Option Explicit
Public Function CONCATENATEIF(Criteria_Area, Criterion, Concate_Area)
Dim mydic As Object
Dim L As Long
Set mydic = CreateObject("Scripting.Dictionary")
For L = 1 To Criteria_Area.Count
If Criteria_Area(L) = Criterion Then
mydic(L) = Concate_Area(L)
End If
Next
CONCATENATEIF = Join(mydic.items, ", ")
End Function

Need to extract 4 spaces of text before the occurrence of a word that appears in a column in a df, and may occur several times per row

I need to extract text (4 characters) before the occurrence of the word "exception" per row in a column of my dataframe. For example, see two lines of my data below:
MPSA: Original Version (01/16/2015); FMV Exception: Original Version (04/11/2014); MM Exception: 08.19.15 (08/19/2015)
MPSA: Original Version (02/10/2015); FMV Exception: Original Version (12/18/2014); MEI FMV: V3 (12/18/2014); MEI FMV: updated (11/18/2014); Meeting Material exception: Original Version (04/21/2014);
As you can see, "exception" occurrs more than one time per line, is sometimes capitalized and sometimes not, and has different text before. I need to extract the "FMV", "MM", and "ial" that come before in each case. The goal is to extract as a version of the following (comma separating would be fine but not needed):
"FMVMM"
"FMVial"
I am planning on making all text lower case for simplicity, but I cannot find a regex to extract the 4 characters of text I need after that. Any recommendations?

You basically need strsplit, substr and nchar:
t1 <- "1.MPSA: Original Version (01/16/2015); FMV Exception: Original Version (04/11/2014); MM Exception: 08.19.15 (08/19/2015)"
t2 <- "2.MPSA: Original Version (02/10/2015); FMV Exception: Original Version (12/18/2014); MEI FMV: V3 (12/18/2014); MEI FMV: updated (11/18/2014); Meeting Material exception: Original Version (04/21/2014); "
f <- function(x){
tmp <- strsplit(x, "[Ee]xception")[[1]]
ret <- array(dim = length(tmp) - 1)
for(i in 1:length(ret)){
ret[i] <- substr(tmp[i], start = nchar(tmp[i]) - 3, stop = nchar(tmp[i]))
}
return(paste(ret, collapse = ","))
}
f(t1) #gives "FMV , MM "
f(t2) #gives "FMV ,ial "
Avoiding the loop would be better but for now, this should work.
Edit by Qaswed: Improved the function (shorter and does not need tolower any more).
Edit by TigeronFire:
#Qaswed, thank you for your guidance - the answer, however, poses another problem. t1 and t2 are only two lines on a dataframe 10000 rows long. I attempted to add the column logic to the function you built a few different ways, but I always received the error message:
"Error in strsplit(BOSSMWF_practice$Documents, "[Ee]xception") : non-character argument"
I tried the following with reference to dataframe column BOSSMWF_practice$Documents:
f <- function(x){
tmp <- strsplit(BOSSMWF_practice$Documents, "[Ee]xception")[[1]]
ret <- array(dim = length(tmp) - 1)
for(i in 1:length(ret)){
ret[i] <- substr(tmp[i], start = nchar(tmp[i]) - 3, stop = nchar(tmp[i]))
}
return(paste(ret, collapse = ","))
}
AND:
f <- function(x){
BOSSMWF_practice$tmp <- strsplit(BOSSMWF_practice$Documents, "[Ee]xception")[[1]]
BOSSMWF_practice$ret <- array(dim = length(BOSSMWF_practice$tmp) - 1)
for(i in 1:length(BOSSMWF_practice$ret)){
BOSSMWF_practice$ret[i] <- substr(BOSSMWF_practice$tmp[i], start = nchar(BOSSMWF_practice$tmp[i]) - 3, stop = nchar(BOSSMWF_practice$tmp[i]))
}
return(paste(ret, collapse = ","))
}
I attempted to run the function on my applicable column using both function setups
BOSSMWF_practice$Funct <- f(BOSSMWF_practice$Documents)
But I always received the above error message. Can you take your advice one step further and indicate how to apply this to a dataframe and place the results in a new column?
Edit by Qaswed:
#TigeronFire you should have added a comment to my answer or editing your question, but not editing my question. To your comment:
#if your dataset looks something like this:
df <- data.frame(variable_name = c(t1, t2))
#...use
apply(df, 1, FUN = f)
#note: there was an error in f. You need strsplit(x, ...) and not strsplit(t1, ...).

Exclude a few columns from a grouped selection by `dplyr::contains`

Suppose a data frame with several groups of columns (linked by their names, here Bla and D):
df = data.frame(A=1, BlaTata=2, BlaTato=3, BlaTota=4, BlaToto=5,
C=6, D1=7, D2=8, D3=9, D4=10)
# A BlaTata BlaTato BlaTota BlaToto C D1 D2 D3 D4
# 1 2 3 4 5 6 7 8 9 10
How can I easily drop all columns containing Bla (i.e., select(-contains('Bla'))) except for a few of them that I would explicitely "protect" from the (de)selection procedure?
Supposing I want to "protect" BlaTato and BlaToto:
df %>% mutate(saveBlaToto=BlaToto, saveBlaTato=BlaTato) %>%
select(-starts_with('Bla')) %>%
mutate(BlaToto=saveBlaToto, BlaTato=saveBlaTato) %>%
select(-contains('save')) %>%
select(order(colnames(.)))
# A BlaTato BlaToto C D1 D2 D3 D4
# 1 3 5 6 7 8 9 10
There must be an easier and more elegant way ;-)
Supposing it is not handy to select by column index etc.
Something like select(-contains('Bla' but keep c('BlaTato','BlaToto'))) possibly for several columns to be preserved...
EDIT
This question is answered in Frank's "New Question" below.
The original question, simpler and answered in his "First Question", was "How to drop all columns containing B except from B2 in the following data frame":
df = data.frame(A=1, B1=2, B2=3, B3, B4=5, C=6, D1=7, D2=8, D3=9, D4=10)

First question. If you look at ?select, you'll see that you can enter a regular expression, like
# example
df = data.frame(A=1, B1=2, B2=3, B3=4, B4=5, C=6, D1=7, D2=8, D3=9, D4=10)
# goal: drop B, protect B2
df %>% select(-matches('^B[^2]$'))
A B2 C D1 D2 D3 D4
1 1 3 6 7 8 9 10
Reading the regex:
^ and $ indicate start and end of the string.
[^x] means any character except x.
New question. It looks like dplyr doesn't support Perl-style regexes yet, so...
# example
df = data.frame(A=1, BlaTata=2, BlaTato=3, BlaTota=4, BlaToto=5,
C=6, D1=7, D2=8, D3=9, D4=10)
# goal: drop Bla, protect BlaTato, BlaToto
df %>% select(-grep('^Bla(?!Tato|Toto)', names(.), perl=TRUE))
A BlaTato BlaToto C D1 D2 D3 D4
1 1 3 5 6 7 8 9 10
Reading the regex:
(?!xyz) means "don't be followed by xyz"
x|y means x or y
For more info on regular expressions and the base R functions for using them, read ?regex and ?grep. Really, though, you shouldn't name your columns like this. If you find yourself in a position where you need to parse column names, you probably made a mistake earlier on.

pattern matching in R using grepl

I have a dataframe dat like this
P pedigree cas
1 M rs2745406 T
2 M rs6939431 A
3 M SNP_DPB1_33156641 G
4 M SNP_DPB1_33156664_G P
5 M SNP_DPB1_33156664_A A
6 M SNP_DPB1_33156664_T A
I want to exclude all rows where the pedigree column starts with SNP_ and ends with either G, C, T, or A (_[GCTA]). In this case, this would be rows 4,5,6.
How can I achieve this in R? I have tried
multisnp <- which(grepl("^SNP_*_[GCTA]$", dat$pedigree)=="TRUE")
new_dat <- dat[-multisnp,]
My multisnp vector is empty, but I can't figure out how to fix it so that it matches the pattern I want. I think it is my wildcard * usage that is wrong.

You can use the following with .*? (match everything in non greedy way):
multisnp <- which(grepl("^SNP_.*?_[GCTA]$", dat$pedigree))
^^^

You can subset dat like this
new_dat <- dat[!grepl("^SNP_.*_[GCTA]$", dat$pedigree), ]
Regarding the code that you've tried, I'm not sure that grepl("^SNP_*_[GCTA]$") will complete without an error since you aren't passing in an x vector to grepl. See ?grepl for more info.

R Subset Dataset Using Regular Expression

Is there a way to make the R code below run quicker (i.e. vectorized to avoid use of for loops)?
My example contains two data frames. First is dimension n1*p. One of the p columns contains names. Second data frame is a column vector (n2*1). It contains names as well. I want to keep all rows of the first data frame, where some part of the name in the column vector of the second data frame appears in the corresponding first data frame. Sorry for the brutal explanation.
Example (Data frame 1):
x y
Doggy 1
Hello 2
Hi Dog 3
Zebra 4
Example (Data frame 2)
z
Hello
Dog
So in the above example I want to keep rows 1,2,3 but NOT 4. Since "Dog" appears in "Doggy" and "Hi Dog". And "Hello" appears in "Hello". Exclude row four since no part of "Hello" or "Dog" appears in "Zebra".
Below is my R code to do this...runs fine. However, for my real task. Data frame 1 has 1 million rows and data frame 2 has 50 items to match on. So runs pretty slow. Any suggestion on how to speed this up are appreciated.
x <- c("Doggy", "Hello", "Hi Dog", "Zebra")
y <- 1:4
dat <- as.data.frame(cbind(x,y))
names(dat) <- c("x","y")
z <- as.data.frame(c("Hello", "Dog"))
names(z) <- c("z")
dat$flag <- NA
for(j in 1:length(z$z)){
for(i in 1:dim(dat)[1]){
if ( is.na(dat$flag[i])==TRUE ) {
dat$flag[i] <- length(grep(paste(z[j,1]), dat[i,1], perl=TRUE, value=TRUE))
} else {
if (dat$flag[i]==0) {
dat$flag[i] <- length(grep(paste(z[j,1]), dat[i,1], perl=TRUE, value=TRUE))
} else {
if (dat$flag[i]==1) {
dat$flag[i]==1
}
}
}
}
}
dat1 <- subset(dat, flag==1)
dat1

Try this:
dat[grep(paste(z$z, collapse = "|"), dat$x), ]
or
subset(dat, grepl(paste(z$z, collapse = "|"), x))

This question inspired a boolean text search function (%bs%) in the qdap package and thus I thought I'd share the approach to this question:
library(qdap)
dat[dat$x %bs% paste(z$z, collapse = "OR"), ]
In this case no less typing but if multiple or/and statements are involved this may be a useful approach.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Use a regular expression extract substring from data frame columns in R - regex

str_extract(foo[3],"(?i)S[A-Z]{3}[0-9]{1,2}-[0-9]_02") seems to work. Somehow, my R gave me "Error in check_pattern(pattern, string) : could not find function "regex"" when using your original expression.

Related

Excel | Get all column/row names in which a specific text is as a list

Need to extract 4 spaces of text before the occurrence of a word that appears in a column in a df, and may occur several times per row

Exclude a few columns from a grouped selection by `dplyr::contains`

pattern matching in R using grepl

R Subset Dataset Using Regular Expression

Categories

Resources