pattern matching in R using grepl - regex

I have a dataframe dat like this
P pedigree cas
1 M rs2745406 T
2 M rs6939431 A
3 M SNP_DPB1_33156641 G
4 M SNP_DPB1_33156664_G P
5 M SNP_DPB1_33156664_A A
6 M SNP_DPB1_33156664_T A
I want to exclude all rows where the pedigree column starts with SNP_ and ends with either G, C, T, or A (_[GCTA]). In this case, this would be rows 4,5,6.
How can I achieve this in R? I have tried
multisnp <- which(grepl("^SNP_*_[GCTA]$", dat$pedigree)=="TRUE")
new_dat <- dat[-multisnp,]
My multisnp vector is empty, but I can't figure out how to fix it so that it matches the pattern I want. I think it is my wildcard * usage that is wrong.

You can use the following with .*? (match everything in non greedy way):
multisnp <- which(grepl("^SNP_.*?_[GCTA]$", dat$pedigree))
^^^

You can subset dat like this
new_dat <- dat[!grepl("^SNP_.*_[GCTA]$", dat$pedigree), ]
Regarding the code that you've tried, I'm not sure that grepl("^SNP_*_[GCTA]$") will complete without an error since you aren't passing in an x vector to grepl. See ?grepl for more info.

Related

Need to extract 4 spaces of text before the occurrence of a word that appears in a column in a df, and may occur several times per row

I need to extract text (4 characters) before the occurrence of the word "exception" per row in a column of my dataframe. For example, see two lines of my data below:
MPSA: Original Version (01/16/2015); FMV Exception: Original Version (04/11/2014); MM Exception: 08.19.15 (08/19/2015)
MPSA: Original Version (02/10/2015); FMV Exception: Original Version (12/18/2014); MEI FMV: V3 (12/18/2014); MEI FMV: updated (11/18/2014); Meeting Material exception: Original Version (04/21/2014);
As you can see, "exception" occurrs more than one time per line, is sometimes capitalized and sometimes not, and has different text before. I need to extract the "FMV", "MM", and "ial" that come before in each case. The goal is to extract as a version of the following (comma separating would be fine but not needed):
"FMVMM"
"FMVial"
I am planning on making all text lower case for simplicity, but I cannot find a regex to extract the 4 characters of text I need after that. Any recommendations?
You basically need strsplit, substr and nchar:
t1 <- "1.MPSA: Original Version (01/16/2015); FMV Exception: Original Version (04/11/2014); MM Exception: 08.19.15 (08/19/2015)"
t2 <- "2.MPSA: Original Version (02/10/2015); FMV Exception: Original Version (12/18/2014); MEI FMV: V3 (12/18/2014); MEI FMV: updated (11/18/2014); Meeting Material exception: Original Version (04/21/2014); "
f <- function(x){
tmp <- strsplit(x, "[Ee]xception")[[1]]
ret <- array(dim = length(tmp) - 1)
for(i in 1:length(ret)){
ret[i] <- substr(tmp[i], start = nchar(tmp[i]) - 3, stop = nchar(tmp[i]))
}
return(paste(ret, collapse = ","))
}
f(t1) #gives "FMV , MM "
f(t2) #gives "FMV ,ial "
Avoiding the loop would be better but for now, this should work.
Edit by Qaswed: Improved the function (shorter and does not need tolower any more).
Edit by TigeronFire:
#Qaswed, thank you for your guidance - the answer, however, poses another problem. t1 and t2 are only two lines on a dataframe 10000 rows long. I attempted to add the column logic to the function you built a few different ways, but I always received the error message:
"Error in strsplit(BOSSMWF_practice$Documents, "[Ee]xception") : non-character argument"
I tried the following with reference to dataframe column BOSSMWF_practice$Documents:
f <- function(x){
tmp <- strsplit(BOSSMWF_practice$Documents, "[Ee]xception")[[1]]
ret <- array(dim = length(tmp) - 1)
for(i in 1:length(ret)){
ret[i] <- substr(tmp[i], start = nchar(tmp[i]) - 3, stop = nchar(tmp[i]))
}
return(paste(ret, collapse = ","))
}
AND:
f <- function(x){
BOSSMWF_practice$tmp <- strsplit(BOSSMWF_practice$Documents, "[Ee]xception")[[1]]
BOSSMWF_practice$ret <- array(dim = length(BOSSMWF_practice$tmp) - 1)
for(i in 1:length(BOSSMWF_practice$ret)){
BOSSMWF_practice$ret[i] <- substr(BOSSMWF_practice$tmp[i], start = nchar(BOSSMWF_practice$tmp[i]) - 3, stop = nchar(BOSSMWF_practice$tmp[i]))
}
return(paste(ret, collapse = ","))
}
I attempted to run the function on my applicable column using both function setups
BOSSMWF_practice$Funct <- f(BOSSMWF_practice$Documents)
But I always received the above error message. Can you take your advice one step further and indicate how to apply this to a dataframe and place the results in a new column?
Edit by Qaswed:
#TigeronFire you should have added a comment to my answer or editing your question, but not editing my question. To your comment:
#if your dataset looks something like this:
df <- data.frame(variable_name = c(t1, t2))
#...use
apply(df, 1, FUN = f)
#note: there was an error in f. You need strsplit(x, ...) and not strsplit(t1, ...).

Use a regular expression extract substring from data frame columns in R

I am fairly new to R so please go easy on me if this is a stupid question.
I have a dataframe called foo:
< head(foo)
Old.Clone.Name New.Clone.Name File
1 A Aa A_mask_MF_final_IS2_SAEE7-1_02.nrrd
2 B Bb B_mask_MF_final_IS2ViaIS2h_SADQ15-1_02.nrrd
3 C Cc C_mask_MF_final_IS2ViaIS2h_SAEC16-1_02.nrrd
4 D Dd D_mask_MF_final_IS2ViaIS2h_SAEJ6-1_02.nrrd
5 E Ee F_mask_MF_final_IS2_SAED9-1_02.nrrd
6 F Ff F_mask_MF_final_IS2ViaIS2h_SAGP3-1_02.nrrd
I want to extract codes from the File column that match the regular expression (S[A-Z]{3}[0-9]{1,2}-[0-9]_02), to give me:
SAEE7-1_02
SADQ15-1_02
SAEC16-1_02
SAEJ6-1_02
SAED9-1_02
SAGP3-1_02
I then want to use these codes to search another directory for other files that contain the same code.
I fail, however, at the first hurdle and cannot extract the codes from that column of the data frame.
I have tried:
library('stringr')
str_extract(foo[3],regex("(S[A-Z]{3}[0-9]{1,2}-[0-9]_02)", ignore_case = TRUE))
but this just returns [1] NA.
Am I simply missing something obvious? I look forward to cracking this with a bit of help from the community.
Hello if you are reading the data as a table file then foo[3] is a list and str_extract does not accept lists, only strings, then you should use lapply to extract the match of every element.
lapply(foo[3], function(x) str_extract(x, "[sS][a-zA-Z]{3}[0-9]{1,2}-[0-9]_02"))
Result:
[1] "SAEE7-1_02" "SADQ15-1_02" "SAEC16-1_02" "SAEJ6-1_02" "SAED9-1_02"
[6] "SAGP3-1_02"
str_extract(foo[3],"(?i)S[A-Z]{3}[0-9]{1,2}-[0-9]_02")
seems to work. Somehow, my R gave me
"Error in check_pattern(pattern, string) : could not find function "regex""
when using your original expression.
The following code will repeat what you asked (just copy and paste to your R console):
library(stringr)
foo = scan(what='')
Old.Clone.Name New.Clone.Name File
A Aa A_mask_MF_final_IS2_SAEE7-1_02.nrrd
B Bb B_mask_MF_final_IS2ViaIS2h_SADQ15-1_02.nrrd
C Cc C_mask_MF_final_IS2ViaIS2h_SAEC16-1_02.nrrd
D Dd D_mask_MF_final_IS2ViaIS2h_SAEJ6-1_02.nrrd
E Ee F_mask_MF_final_IS2_SAED9-1_02.nrrd
F Ff F_mask_MF_final_IS2ViaIS2h_SAGP3-1_02.nrrd
foo = matrix(foo,ncol=3,byrow=T)
colnames(foo)=foo[1,]
foo = foo[-1,]
foo
str_extract(foo[,3],regex("(S[A-Z]{3}[0-9]{1,2}-[0-9]_02)", ignore_case = T))
The reason you get NULL is hidden: R stores entries by column, hence foo[3] is the 3rd row and 1st column of foo matrix/data frame. To quote the third column, you may need to use foo[,3]. or foo<-data.frame(foo); foo[[3]].

R: replacing values in string all at once

I have a data frame that looks like this:
USequence
# 1 GATCAGATC
# 2 ATCAGAC
I'm trying to create a function that would replace all the G's with C's, A's with T's, C's with G's, and T's with A's:
USequence
# 1 CTAGTCTAG
# 2 TAGTCTG
This is what I have right now, the function accepts k, a data frame with a column named USequence.
conjugator <- function(k) {
k$USequence <- str_replace_all(k$USequence,"A","T")
k$USequence <- str_replace_all(k$USequence,"T","A")
k$USequence <- str_replace_all(k$USequence,"G","C")
k$USequence <- str_replace_all(k$USequence,"C","G")
}
However the obvious problem would be that this is doesn't replace the characters at once, but rather in steps which would not return the desired result. Any suggestions? Thanks
You could use chartr
df1$USequence <- chartr('GATC', 'CTAG', df1$USequence)
df1$USequence
#[1] "CTAGTCTAG" "TAGTCTG"
Or
library(gsubfn)
gsubfn('[GATC]', list(G='C', A='T', T='A', C='G'), df1$USequence)
#[1] "CTAGTCTAG" "TAGTCTG"

R agrep: how to match with more than 1 substitution

I'm trying to match a string to a vector of strings:
a <- c('abcde', 'abcdf', 'abcdg')
agrep('abcdh', a, max.distance=list(substitutions=1))
# [1] 1 2 3
agrep('abchh', a, max.distance=list(substitutions=2))
# character(0)
I didn't expect the latter result as substituting two characters from
the pattern makes the pattern identical to the vector elements. This does, however, work with all instead of substitutions:
agrep('abchh', a, max.distance=list(all=2))
# [1] 1 2 3
What do I need to change to match with more than 1 substitution allowed? Is substitution just a broken option? Thanks.
Note: this question is essentially the same as this one: https://stat.ethz.ch/pipermail/r-help/2011-June/281731.html, but that was never answered.
I did not realize that the questions were that old, anyway:
The function needs cost to be appropiate. As ping said, you must set the maximum number of match cost; in your example:
a <- c('abcde', 'abcdf', 'abcdg')
agrep('abcdh', a, max.distance = list(cost = 1))
[1] 1 2 3
agrep('abchh', a, max.distance = 2)
[1] 1 2 3
Now, if you set cost the program can do insertions, deletions and substitutions. If you want only evaluate substitutions, then:
agrep('abhhh', a,
max.distance=list(cost=3, substitutions=3,
deletions=0, insertions=0))
[1] 1 2 3

Generating a vector of the number of items in each list item

I have a list containing 98 items. But each item contains 0, 1, 2, 3, 4 or 5 character strings.
I know how to get the length of the list and in fact someone has asked the question before and got voted down for presumably asking such an easy question.
But I want a vector that is 98 elements long with each element being an integer from 0 to 5 telling me how many character strings there are in each list item.
I was expecting the following to work but it did not.
lapply(name.of.list,length())
From my question you will see that I do not really know the nomeclature of lists and items. Feel free to straighten me out.
Farrel, I do not exactly follow as 'item' is not an R type. Maybe you have a list of length 98 where each element is a vector of character string?
In that case, consider this:
R> fl <- list(A=c("un", "deux"), B=c("one"), C=c("eins", "zwei", "drei"))
R> lapply(fl, function(x) length(x))
$A
[1] 2
$B
[1] 1
$C
[1] 3
R> do.call(rbind, lapply(fl, function(x) length(x)))
[,1]
A 2
B 1
C 3
R>
So there is you vector of the length of your list, telling you how many strings each list element has. Note the last do.call(rbind, someList) as we got a list back from lapply.
If, on the other hand, you want to count the length of all the strings at each list position, replace the simple length(x) with a new function counting the characters:
R> lapply(fl, function(x) { sapply(x, function(y) nchar(y)) } )
$A
un deux
2 4
$B
one
3
$C
eins zwei drei
4 4 4
R>
If that is not want you want, maybe you could mock up some example input data?
Edit:: In response to your comments, what you wanted is probably:
R> do.call(rbind, lapply(fl, length))
[,1]
A 2
B 1
C 3
R>
Note that I pass in length, the name of a function, and not length(), the (displayed) body of a function. Because that is easy to mix up, I simply apply almost always wrap an anonymous function around as in my first answer.
And yes, this can also be done with just sapply or even some of the **ply functions:
R> sapply(fl, length)
A B C
2 1 3
R> lapply(fl, length)
[1] 2 1 3
R>
All this seems very complicated - there is a function specifically doing what you were asking for:
lengths #note the plural "s"
Using Dirks sample data:
fl <- list(A=c("un", "deux"), B=c("one"), C=c("eins", "zwei", "drei"))
lengths(fl)
will return a named integer vector:
A B C
2 1 3
The code below accepts a list and returns a vector of lengths:
x = c("vectors", "matrices", "arrays", "factors", "dataframes", "formulas",
"shingles", "datesandtimes", "connections", "lists")
xl = list(x)
fnx = function(xl){length(unlist(strsplit(x, "")))}
lv = sapply(x, fnx)