regular expression string between two [ ] in R - regex

I am stuck on regular expressions yet again but this time in R.
The problem I am facing is that I a vector I would like to extract a string between two [] for each row in the vector. However, sometimes I have cases where there is more than one series of [ ] in the whole statement and so I am recovering all strings in each row that is in the [ ]. In all cases I just need to recover the first instance of the string in the [ ] not the second or more instances. The example dataframe I have is:
comp541_c0_seq1 gi|356502740|ref|XP_003520174.1| PREDICTED: uncharacterized protein LOC100809655 [Glycine max]
comp5041_c0_seq1 gi|460370622|ref|XP_004231150.1| [Solanum lycopersicum] PREDICTED: uncharacterized protein LOC101250457 [Solanum lycopersicum]
The code i have been using that recovers the string and the index and makes a vector in the new dataframe are:
pattern <- "\\[\\w*\\s\\w*]"
match<- gregexpr(pattern, data$Description)
data$Species <- regmatches(data$Description, match)
the structure of the dataframe that I am using is:
data.frame': 67911 obs. of 6 variables:
$ Column1 : Factor w/ 67911 levels "comp100012_c0_seq1 ",..: 3344 8565 17875 18974 19059 19220 21429 29791 40214 48529 ...
$ Description : Factor w/ 26038 levels "0.0","1.13142e-173",..: NA NA NA NA NA NA NA NA 7970 NA ...
So the problem with my pattern match is that it return a vector (Species) where some of the rows have:
[Glycine max] # this is good
c("[Solanum lycopersicum]", "[Solanum lycopersicum]") # I only need one set returned
What I would like is:
[Glycine max]
[Solanum lycopersicum]
I have been trying every way I can with the regular expression. Would anyone know how to improve what I have to just extract the first instance of the string within [ ]?
Thanks in advance.

I think this example should be illuminating to your problems:
txt <- c("[Bracket text]","[Bracket text1] and [Bracket text2]","No brackets in here")
pattern <- "\\[\\w*\\s\\w*]"
mat <- regexpr(pattern,txt)
#[1] 1 1 -1
#attr(,"match.length")
#[1] 14 15 -1
txt[mat != -1] <- regmatches(txt, mat)
txt
#[1] "[Bracket text]" "[Bracket text1]" "No brackets in here"
Or if you want to do it all in one go and return NA values for non-matches, try:
ifelse(mat != -1, regmatches(txt,mat), NA)
#[1] "[Bracket text]" "[Bracket text1]" NA

Using the base-R facilities for string manipulation is just making life hard for yourself. Use rebus to create the regular expression, and stringi (or stringr) to get the matches.
library(rebus)
library(stringi)
txt <- c("[Bracket text]","[Bracket text1] and [Bracket text2]","No brackets in here") # thanks, thelatemail
pattern <- OPEN_BRACKET %R%
alnum(1, Inf) %R%
space(1, Inf) %R%
alnum(1, Inf) %R%
"]"
stri_extract_first_regex(txt, pattern)
## [1] "[Bracket text]" "[Bracket text1]" NA
I suspect that you probably don't want to keep those square brackets. Try this variant:
pattern <- OPEN_BRACKET %R%
capture(
alnum(1, Inf) %R%
space(1, Inf) %R%
alnum(1, Inf)
) %R%
"]"
stri_match_first_regex(txt, pattern)[, 2]
## [1] "Bracket text" "Bracket text1" NA

Related

Remove everything except period and numbers from string regex in R

I know there are many questions on stack overflow regarding regex but I cannot accomplish this one easy task with the available help I've seen. Here's my data:
a<-c("Los Angeles, CA","New York, NY", "San Jose, CA")
b<-c("c(34.0522, 118.2437)","c(40.7128, 74.0059)","c(37.3382, 121.8863)")
df<-data.frame(a,b)
df
a b
1 Los Angeles, CA c(34.0522, 118.2437)
2 New York, NY c(40.7128, 74.0059)
3 San Jose, CA c(37.3382, 121.8863)
I would like to remove the everything but the numbers and the period (i.e. remove "c", ")" and "(". This is what I've tried thus far:
str_replace(df$b,"[^0-9.]","" )
[1] "(34.0522, 118.2437)" "(40.7128, 74.0059)" "(37.3382, 121.8863)"
str_replace(df$b,"[^\\d\\)]+","" )
[1] "34.0522, 118.2437)" "40.7128, 74.0059)" "37.3382, 121.8863)"
Not sure what's left to try. I would like to end up with the following:
[1] "34.0522, 118.2437" "40.7128, 74.0059" "37.3382, 121.8863"
Thanks.
If I understand you correctly, this is what you want:
df$b <- gsub("[^[:digit:]., ]", "", df$b)
or:
df$b <- strsplit(gsub("[^[:digit:]. ]", "", df$b), " +")
> df
a b
1 Los Angeles, CA 34.0522, 118.2437
2 New York, NY 40.7128, 74.0059
3 San Jose, CA 37.3382, 121.8863
or if you want all the "numbers" as a numeric vector:
as.numeric(unlist(strsplit(gsub("[^[:digit:]. ]", "", df$b), " +")))
[1] 34.0522 118.2437 40.7128 74.0059 37.3382 121.8863
Try this
gsub("[\\c|\\(|\\)]", "",df$b)
#[1] "34.0522, 118.2437" "40.7128, 74.0059" "37.3382, 121.8863"
Not a regular expression solution, but a simple one.
The elements of b are R expressions, so loop over each element, parsing it, then creating the string you want.
vapply(
b,
function(bi)
{
toString(eval(parse(text = bi)))
},
character(1)
)
Here is another option with str_extract_all from stringr. Extract the numeric part using str_extract_all into a list, convert to numeric, rbind the list elements and cbind it with the first column of 'df'
library(stringr)
cbind(df[1], do.call(rbind,
lapply(str_extract_all(df$b, "[0-9.]+"), as.numeric)))

Split words in R Dataframe column

I have a data frame with words in a column separated by single space. I want to split it into three types as below. Data frame looks as below.
Text
one of the
i want to
I want to split it into as below.
Text split1 split2 split3
one of the one one of of the
I am able to achieve 1st. Not able to figure out the other two.
my code to get split1:
new_data$split1<-sub(" .*","",new_data$Text)
Figured out the split2:
df$split2 <- gsub(" [^ ]*$", "", df$Text)
We can try with gsub. Capture one or more non-white space (\\S+) as a group (in this case there are 3 words), then in the replacement, we rearrange the backreference and insert a delimiter (,) which we use for converting to different columns with read.table.
df1[paste0("split", 1:3)] <- read.table(text=gsub("(\\S+)\\s+(\\S+)\\s+(\\S+)",
"\\1,\\1 \\2,\\2 \\3", df1$Text), sep=",")
df1
# Text split1 split2 split3
#1 one of the one one of of the
#2 i want to i i want want to
data
df1 <- structure(list(Text = c("one of the", "i want to")),
.Names = "Text", class = "data.frame", row.names = c(NA, -2L))
There might be more elegant solutions. Here are two options:
Using ngrams:
library(dplyr); library(tm)
df %>% mutate(splits = strsplit(Text, "\\s+")) %>%
mutate(split1 = lapply(splits, `[`, 1)) %>%
mutate(split2 = lapply(splits, function(words) ngrams(words, 2)[[1]]),
split3 = lapply(splits, function(words) ngrams(words, 2)[[2]])) %>%
select(-splits)
Text split1 split2 split3
1 one of the one one, of of, the
2 i want to i i, want want, to
Extract the two grams manually:
df %>% mutate(splits = strsplit(Text, "\\s+")) %>%
mutate(split1 = lapply(splits, `[`, 1)) %>%
mutate(split2 = lapply(splits, `[`, 1:2),
split3 = lapply(splits, `[`, 2:3)) %>%
select(-splits)
Text split1 split2 split3
1 one of the one one, of of, the
2 i want to i i, want want, to
Update:
With regular expression, we can use back reference of gsub.
Split2:
gsub("((.*)\\s+(.*))\\s+(.*)", "\\1", df$Text)
[1] "one of" "i want"
Split3:
gsub("(.*)\\s+((.*)\\s+(.*))", "\\2", df$Text)
[1] "of the" "want to"
This is a bit of hackish solution.
Assumption :- you are not concerned about number of spaces between two words.
> library(stringr)
> x<-c('one of the','i want to')
> strsplit(gsub('(\\S+)\\s+(\\S+)\\s+(.*)', '\\1 \\1 \\2 \\2 \\3', x), '\\s\\s+')
#[[1]]
#[1] "one" "one of" "of the"
#[[2]]
#[1] "i" "i want" "want to"

Difficulty Cleaning a Data Frame Using Regexp in R

I am relatively new to R and having difficulty cleaning up a data frame using regex.
One of the columns of that data frame has strings such as:
NUMERO_APPEL
1 NNA
2 VQ-40989
3 41993
4 41993
5 42597
6 VQ-42597
7 DER8
8 40001-2010
I would like to extract the 5 consecutive digits of the strings that have the following format and only the following format, all other strings will be replaced by NAs.
AO-11111
VQ-11111
11111
** Even if Case 8 contains 5 consecutive numbers, it will be replaced by NA as well... Furthermore, a more than or less than 5 digits long number would also be replaced by NA.
Note that the 5 consecutive digits could be any number [0-9], but the characters 'AO-' and 'VQ-' are fixed (i.e. 'AO ' or 'VE-' would be replaced to NA as well.)
This is the code that I currently have:
# Declare a Function that Extracts the 1st 'n' Characters Starting from the Right!
RightSubstring <- function(String, n) {
substr(String, nchar(String)-n+1, nchar(String))
}
# Declare Function to Remove NAs in Specific Columns!
ColRemNAs <- function(DataFrame, Column) {
CompleteVector <- complete.cases(DataFrame, Column)
return(DataFrame[CompleteVector, ])
Contrat$NUMERO_APPEL <- RightSubstring(as.character(Contrat$NUMERO_APPEL), 5)
Contrat$NUMERO_APPEL <- gsub("[^0-9]", NA, Contrat$NUMERO_APPEL)
Contrat$NUMERO_APPEL <- as.numeric(Contrat$NUMERO_APPEL)
# Efface les Lignes avec des éléments NAs.
Contrat <- ColRemNAs(Contrat, Contrat$NUMERO_COMMANDE)
Contrat <- ColRemNAs(Contrat, Contrat$NO_FOURNISSEUR)
Contrat <- ColRemNAs(Contrat, Contrat$NUMERO_APPEL)
Contrat <- ColRemNAs(Contrat, Contrat$MONTANT_INITIAL)
Contrat <- ColRemNAs(Contrat, Contrat$MONTANT_ACTUEL)
}
Thanks in advance. Hope my explanations were clear!
Here is a base R solution which will match 5 digits occurring only in the following three forms:
AO-11111
VQ-11111
11111
I use this regular expression to match the five digits:
^((AQ|VQ)-)?(\\d{5})$
Strings which match begin with an optional AQ- or VQ-, and then are followed by 5 consecutive digits, after which the string must terminate.
The following code substitutes all matching patterns with the 5 digits found, and stores NA into all non-matching patterns.
ind <- grep("^((AQ|VQ)-)?(\\d{5})$", Contrat$NUMERO_APPEL, value = FALSE)
Contrat$NUMERO_APPEL <- gsub("^(((AQ|VQ)-)?(\\d{5}))$", "\\4", Contrat$NUMERO_APPEL)
Contrat$NUMERO_APPEL[-ind] <- NA
For more reading see this SO post.
library(dplyr)
library(stringi)
df %>%
mutate(NUMERO_APPEL.fix =
NUMERO_APPEL %>%
stri_extract_first_regex("[0-9]{5}") %>%
as.numeric)

str_extract specific patterns (example)

I'm still a little confused by regex syntax. Can you please help me with these patterns:
_A00_A1234B_
_A00_A12345B_
_A1_A12345_
my approaches so far:
vapply(strsplit(files, "[_.]"), function(files) files[nchar(files) == 7][1], character(1))
or
str_extract(str2, "[A-Z][0-9]{5}[A-Z]")
The expected outputs are
A1234B
A12345B
A12345
Thanks!
You can try
library(stringr)
str_extract(str2, "[A-Z][0-9]{4,5}[A-Z]?")
#[1] "A1234B" "A12345B" "A12345"
Here, the pattern looks for a capital letter [A-Z], followed by 4 or 5 digits [0-9]{4,5}, followed by a capital letter [A-Z] ?
Or you can use stringi which would be faster
library(stringi)
stri_extract(str2, regex="[A-Z][0-9]{4,5}[A-Z]?")
#[1] "A1234B" "A12345B" "A12345"
Or a base R option would be
regmatches(str2,regexpr('[A-Z][0-9]{4,5}[A-Z]?', str2))
#[1] "A1234B" "A12345B" "A12345"
data
str2 <- c('_A00_A1234B_', '_A00_A12345B_', '_A1_A12345_')
vec <- c("_A00_A1234B_", "_A00_A12345B_", "_A1_A12345_")
You can use sub and this regex:
sub(".*([A-Z]\\d{4,5}[A-Z]?).*", "\\1", vec)
# [1] "A1234B" "A12345B" "A12345"
Using rex to construct the regular expression may make it more understandable.
x <- c("_A00_A1234B_", "_A00_A12345B_", "_A1_A12345_")
# approach #1, assumes always is between the second underscores.
re_matches(x,
rex(
"_",
anything,
"_",
capture(anything),
"_"
)
)
#> 1
#> 1 A1234B
#> 2 A12345B
#> 3 A12345
# approach #2, assumes an alpha, followed by 4 or 5 digits with a possible trailing alpha.
re_matches(x,
rex(
capture(
alpha,
between(digit, 4, 5),
maybe(alpha)
)
)
)
#> 1
#> 1 A1234B
#> 2 A12345B
#> 3 A12345
You can do this without using a regular expression ...
x <- c('_A00_A1234B_', '_A00_A12345B_', '_A1_A12345_')
sapply(strsplit(x, '_', fixed=T), '[', 3)
# [1] "A1234B" "A12345B" "A12345"
If you insist on using a regular expression, the following will suffice.
regmatches(x, regexpr('[^_]+(?=_$)', x, perl=T))

How to prevent regmatches drop non matches?

I would like to capture the first match, and return NA if there is no match.
regexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
# [1] 1 -1 3 1
# attr(,"match.length")
# [1] 1 -1 1 2
x <- c("abc", "def", "cba a", "aa")
m <- regexpr("a+", x, perl=TRUE)
regmatches(x, m)
# [1] "a" "a" "aa"
So I expected "a", NA, "a", "aa"
Staying with regexpr:
r <- regexpr("a+", x)
out <- rep(NA,length(x))
out[r!=-1] <- regmatches(x, r)
out
#[1] "a" NA "a" "aa"
use regexec instead, since it returns a list which will allow you to catch the character(0)'s before unlisting
R <- regmatches(x, regexec("a+", x))
unlist({R[sapply(R, length)==0] <- NA; R})
# [1] "a" NA "a" "aa"
In R 3.3.0, it is possible to pull out both the matches and the non-matched results using the invert=NA argument. From the help file, it says
if invert is NA, regmatches extracts both non-matched and matched substrings, always starting and ending with a non-match (empty if the match occurred at the beginning or the end, respectively).
The output is a list, typically, in most cases of interest, (matching a single pattern), regmatches with this argument will return a list with elements of either length 3 or 1. 1 is the case of where no matches are found and 3 is the case with a match.
myMatch <- regmatches(x, m, invert=NA)
myMatch
[[1]]
[1] "" "a" "bc"
[[2]]
[1] "def"
[[3]]
[1] "cb" "a" " a"
[[4]]
[1] "" "aa" ""
So to extract what you want (with "" in place of NA), you can use sapply as follows:
myVec <- sapply(myMatch, function(x) {if(length(x) == 1) "" else x[2]})
myVec
[1] "a" "" "a" "aa"
At this point, if you really want NA instead of "", you can use
is.na(myVec) <- nchar(myVec) == 0L
myVec
[1] "a" NA "a" "aa"
Some revisions:
Note that you can collapse the last two lines into a single line:
myVec <- sapply(myMatch, function(x) {if(length(x) == 1) NA_character_ else x[2]})
The default data type of NA is logical, so using it will result in additional data conversions. Using the character version NA_character_, avoids this.
An even slicker extraction method for the final line is to use [:
sapply(myMatch, `[`, 2)
[1] "a" NA "a" "aa"
So you can do the whole thing in a fairly readable single line:
sapply(regmatches(x, m, invert=NA), `[`, 2)
Using more or less the same construction as yours -
chars <- c("abc", "def", "cba a", "aa")
chars[
regexpr("a+", chars, perl=TRUE) > 0
][1] #abc
chars[
regexpr("q", chars, perl=TRUE) > 0
][1] #NA
#vector[
# find all indices where regexpr returned positive value i.e., match was found
#][return the first element of the above subset]
Edit - Seems like I misunderstood the question. But since two people have found this useful I shall let it stay.
You can use stringr::str_extract(string, pattern). It will return NA if there is no matches. It has simpler function interface than regmatches() as well.