Substring content between quotation marks - regex

In a DF I have column entries of different length as the following:
tmp_ezg.\"dr_HE_10691\" , tmp_ezg.\"dr_MV_0110200016\" , tmp_ezg.\"dr_MV_0111290017\" etc.
How can I best substring what's in between the quotation marks?
My idea:
substring(DF$name, 10)
Since the content of the quotation marks has different lengths I cannot provide substring() a value where to stop.
Is there a possibility to substring only between certain symbols (i.e. quotation marks)?

To separate the content between the quotation marks (assuming there are exactly two in each entry), you just split the string by \\\" (escaped backslash and quotation mark):
y <- strsplit(x, split = "\\\"")
If all entries end with a quotation mark, this will give you a list of entries with two values, and the second value in each entry is your string.
[[1]]
[1] "tmp_ezg." "dr_HE_10691"
[[2]]
[1] "tmp_ezg." "dr_MV_0110200016"
[[3]]
[1] "tmp_ezg." "dr_MV_0111290017"

For example
x <- c('tmp_ezg.\"dr_HE_10691\"' ,
'tmp_ezg.\"dr_MV_0110200016\"' ,
'tmp_ezg.\"dr_MV_0111290017\"')
res <- sub('.*?"([^"]+)"', "\\1", x)
print(res, quote=F)
# [1] dr_HE_10691
# [2] dr_MV_0110200016
# [3] dr_MV_0111290017
... if I'm not mistaken.

Related

Positive look-behind in R that includes non-ascii characters

I am trying to extract the first group of non-whitespace characters that follows an Arabic string for each text in a set of about 2,100 total texts. Some of these texts contain the string, while others do not. This would be a very easy task, using str_extract from the stringr package, if the string were in English. However, for some reason this function doesn't work when using an Arabic string within the look-behind pattern:
library(stringr)
test_texts <- c(
"My text كلمة containing some Arabic",
"My text كلمة again containing some Arabic",
"My text that doesn't contain any Arabic"
)
str_extract(test_texts, "(?<=text )\\S+")
# [1] "كلمة" "كلمة" "that"
str_extract(test_texts, "(?<=containing )\\S+")
# [1] "some" "some" NA
str_extract(test_texts, "(?<=كلمة )\\S+") #returns NAs even though string is there
# [1] NA NA NA
Note that this works if I'm not using a look-behind pattern:
str_extract(test_texts, "كلمة \\S+")
# [1] "كلمة containing" "كلمة again" NA
Why does the Arabic mess things up only when using a look-behind pattern?
I am using R version 3.2.3, on OS X 10.11.3, and stringr version 1.0.0.
It seems there is some issue how str_extract processes the right-to-left text inside the positive lookbehind. As a workaround, you may use str_match with a regex having a capturing group around the subpattern capture the value you need:
> res <- str_match(test_texts, "كلمة +(\\S+)")
> res[,2]
[1] "containing" "again" NA
This solution allows matching the non-whitespace chunk even if there are more than 1 space after the Arabic word.
You can grep for non-ascii characters like this:
str_extract(test_texts, "[^\001-\177]+")
[1] "كلمة" "كلمة" NA
str_extract(test_texts, "(?<=[^\001-\177] )\\S+")
[1] "containing" "again" NA
And this seems to work... just adding brackets to what you had. This may not be sufficient either since the characters can be in any order if they are in brackets.
str_extract(test_texts, "(?<=[كلمة] )\\S+")
[1] "containing" "again" NA

strsplit by parentheses [duplicate]

This question already has answers here:
Regular Expression to get a string between parentheses in Javascript
(10 answers)
Closed 7 years ago.
Suppose I have a string like "A B C (123-456-789)", I'm wondering what's the best way to retrieve "123-456-789" from it.
strsplit("A B C (123-456-789)", "\\(")
[[1]]
[1] "A B C" "123-456-789)"
If we want to extract the digits with - between the braces, one option is str_extract. If there are multiple patterns within a string, use str_extract_all
library(stringr)
str_extract(str1, '(?<=\\()[0-9-]+(?=\\))')
#[1] "123-456-789"
str_extract_all(str2, '(?<=\\()[0-9-]+(?=\\))')
In the above codes, we are using regex lookarounds to extract the numbers and the -. The positive lookbehind (?<=\\()[0-9-]+ matches numbers along with - ([0-9-]+) in (123-456-789 and not in 123-456-789. Similarly the lookahead ('[0-9-]+(?=\)') matches numbers along with - in 123-456-789) and not in 123-456-798. Taken together it matches all the cases that satisfy both the conditions (123-456-789) and extract those in between the lookarounds and not with cases like (123-456-789 or 123-456-789)
With strsplit you can specify the split as [()]. We keep the () inside the square brackets to [] to treat it as characters or else we have to escape the parentheses ('\\(|\\)').
strsplit(str1, '[()]')[[1]][2]
#[1] "123-456-789"
If there are multiple substrings to extract from a string, we could loop with lapply and extract the numeric split parts with grep
lapply(strsplit(str2, '[()]'), function(x) grep('\\d', x, value=TRUE))
Or we can use stri_split from stringi which has the option to remove the empty strings as well (omit_empty=TRUE).
library(stringi)
stri_split_regex(str1, '[()A-Z ]', omit_empty=TRUE)[[1]]
#[1] "123-456-789"
stri_split_regex(str2, '[()A-Z ]', omit_empty=TRUE)
Another option is rm_round from qdapRegex if we are interested in extracting the contents inside the brackets.
library(qdapRegex)
rm_round(str1, extract=TRUE)[[1]]
#[1] "123-456-789"
rm_round(str2, extract=TRUE)
data
str1 <- "A B C (123-456-789)"
str2 <- c("A B C (123-425-478) A", "ABC(123-423-428)",
"(123-423-498) ABCDD",
"(123-432-423)", "ABC (123-423-389) GR (124-233-848) AK")
or with sub from base R:
sub("[^(]+\\(([^)]+)\\).*", "\\1", "A B C (123-456-789)")
#[1] "123-456-789"
Explanation:
[^(]+ : matches anything except an opening bracket
\\( : matches an opening bracket, which is just before what you want
([^)]+) : matches the pattern you want to capture (which is then retrieved in replacement="\\1"), which is anything except a closing bracket
\\).* matches a closing bracket followed by anything, 0 or more times
Another option with look-ahead and look-behind
sub(".*(?<=\\()(.+)(?=\\)).*", "\\1", "A B C (123-456-789)", perl=TRUE)
#[1] "123-456-789"
The capture groups in sub will target your desired output:
sub('.*\\((.*)\\).*', '\\1', str1)
[1] "123-456-789"
Extra check to make sure I pass #akrun's extended example:
sub('.*\\((.*)\\).*', '\\1', str2)
[1] "123-425-478" "123-423-428" "123-423-498" "123-432-423" "124-233-848"
You may try these gsub functions.
> gsub("[^\\d-]", "", x, perl=T)
[1] "123-456-789"
> gsub(".*\\(|\\)", "", x)
[1] "123-456-789"
> gsub("[^0-9-]", "", x)
[1] "123-456-789"
Few more...
> gsub("[0-9-](*SKIP)(*F)|.", "", x, perl=T)
[1] "123-456-789"
> gsub("(?:(?![0-9-]).)*", "", x, perl=T)
[1] "123-456-789"
Try this also:
k<-"A B C (123-456-789)"
regmatches(k,gregexpr("*.(\\d+).*",k))[[1]]
[1] "(123-456-789)"
With suggestion from #Arun:
regmatches(k, gregexpr('(?<=\\()[^A-Z ]+(?=\\))', k, perl=TRUE))[[1]]
With suggestion from #akrun:
regmatches(k, gregexpr('[0-9-]+', k))[[1]]

Unable to replace string with back reference using gsub in R

I am trying to replace some text in a character vector using regex in R where, if there is a set of letters inside a bracket, the bracket content is to erplace the whole thing. So, given the input:
tst <- c("85", "86 (TBA)", "87 (LAST)")
my desired output would be equivalent to c("85", "TBA", "LAST")
I tried gsub("\\(([[:alpha:]])\\)", "\\1", tst) but it didn't replace anything. What do I need to correct in my regular expression here?
I think you want
gsub(".*\\(([[:alpha:]]+)\\)", "\\1", tst)
# [1] "85" "TBA" "LAST"
Your first expression was trying to match exactly one alpha character rather than one-or-more. I also added the ".*" to capture the beginning part of the string so it gets replaced as well, otherwise, it would be left untouched.
gsub("(?=.*\\([^)]*\\)).*\\(([^)]*)\\)", "\\1", tst, perl=TRUE)
## [1] "85" "TBA" "LAST"
You can try this.See demo.Replace by \1.
https://regex101.com/r/sH8aR8/38
The following would work. Note that white-spaces within the brackets may be problematic
A<-sapply(strsplit(tst," "),tail,1)
B<-gsub("\\(|\\)", "", A)
I like the purely regex answers better. I'm showing a solution using the qdapRegex package that I maintain as the result is pretty speedy and easy to remember and generalize. It pulls out the strings that are in parenthesis and then replaces any NA (no bracket) with the original value. Note that the result is a list and you'd need to use unlist to match your desired output.
library(qdpRegex)
m <- rm_round(tst, extract=TRUE)
m[is.na(m)] <- tst[is.na(m)]
## [[1]]
## [1] "85"
##
## [[2]]
## [1] "TBA"
##
## [[3]]
## [1] "LAST"

Split on first/nth occurrence of delimiter

I am trying something I thought would be easy. I'm looking for a single regex solution (though others are welcomed for completeness). I want to split on n occurrences of a delimiter.
Here is some data:
x <- "I like_to see_how_too"
pat <- "_"
Desired outcome
Say I want to split on first occurrence of _:
[1] "I like" "to see_how_too"
Say I want to split on second occurrence of _:
[1] "I like_to see" "how_too"
Ideally, if the solution is a regex one liner generalizable to nth occurrence; the solution will use strsplit with a single regex.
Here's a solution that doesn't fit my parameters of single regex that works with strsplit
x <- "I like_to see_how_too"
y <- "_"
n <- 1
loc <- gregexpr("_", x)[[1]][n]
c(substr(x, 1, loc-1), substr(x, loc + 1, nchar(x)))
Here is another solution using the gsubfn package and some regex-fu. To change the nth occurrence of the delimiter, you can simply swap the number that is placed inside of the range quantifier — {n}.
library(gsubfn)
x <- 'I like_to see_how_too'
strapply(x, '((?:[^_]*_){1})(.*)', c, simplify =~ sub('_$', '', x))
# [1] "I like" "to see_how_too"
If you would like the nth occurrence to be user defined, you could use the following:
n <- 2
re <- paste0('((?:[^_]*_){',n,'})(.*)')
strapply(x, re, c, simplify =~ sub('_$', '', x))
# [1] "I like_to see" "how_too"
Non-Solution
Since R is using PCRE, you can use \K to remove everything that matches the pattern before \K from the main match result.
Below is the regex to split the string at the 3rd _
^[^_]*(?:_[^_]*){2}\K_
If you want to split at the nth occurrence of _, just change 2 to (n - 1).
Demo on regex101
That was the plan. However, strsplit seems to think differently.
Actual execution
Demo on ideone.com
x <- "I like_to see_how_too but_it_seems to_be_impossible"
strsplit(x, "^[^_]*(?:_[^_]*)\\K_", perl=TRUE)
strsplit(x, "^[^_]*(?:_[^_]*){1}\\K_", perl=TRUE)
strsplit(x, "^[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# strsplit(x, "^[^_]*(?:_[^_]*)\\K_", perl=TRUE)
# [[1]]
# [1] "I like_to see" "how_too but" "it_seems to" "be_impossible"
# strsplit(x, "^[^_]*(?:_[^_]*){1}\\K_", perl=TRUE)
# [[1]]
# [1] "I like_to see" "how_too but" "it_seems to" "be_impossible"
# strsplit(x, "^[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# [[1]]
# [1] "I like" "to see" "how" "too but" "it"
# [6] "seems to" "be" "impossible"
It still fails to work on a stronger assertion \A
strsplit(x, "\\A[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# [[1]]
# [1] "I like" "to see" "how" "too but" "it"
# [6] "seems to" "be" "impossible"
Explanation?
This behavior hints at the fact that strsplit find the first match, do a substring to extract the first token and the remainder part, and find the next match in the remainder part.
This removes all the states from the previous matches, and leaves us with a clean state when it tries to match the regex on the remainder. This makes the task of stopping the strsplit function at first match and achieving the task at the same time impossible. There is not even a parameter in strsplit to limit the number of splits.
Rather than split you do match to get your split strings.
Try this regex:
^((?:[^_]*_){1}[^_]*)_(.*)$
Replace 1 by n-1 where you're trying to get split on nth occurrence of underscore.
RegEx Demo
Update: It seems R also supports PCRE and in that case you can do split as well using this PCRE regex:
^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_
Replace 1 by n-1 where you're trying to get split on nth occurrence of underscore.
(*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
(*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
(*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.
RegEx Demo2
x <- "I like_to see_how_too"
strsplit(x, "^((?:[^_]*_){0}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
strsplit(x, "^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## > strsplit(x, "^((?:[^_]*_){0}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## [[1]]
## [1] "I like" "to see" "how" "too"
## > strsplit(x, "^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## [[1]]
## [1] "I like_to see" "how_too"
This uses gsubfn to to preprocess the input string so that strsplit can handle it. The main advantage is that one can specify a vector of numbers, k, indicating which underscores to split on.
It replaces the occurrences of underscore defined by k by a double underscore and then splits on double underscore. In this example we split at the 2nd and 4th underscore:
library(gsubfn)
k <- c(2, 4) # split at 2nd and 4th _
p <- proto(fun = function(., x) if (count %in% k) "__" else "_")
strsplit(gsubfn("_", p, "aa_bb_cc_dd_ee_ff"), "__")
giving:
[[1]]
[1] "aa_bb" "cc_dd" "ee_ff"
If empty fields are allowed then use any other character sequence not in the string, e.g. "\01" in place of the double underscore.
See section 4 of the gusbfn vignette for more info on using gusbfn with proto objects to retain state between matches.

Splitting strings with unescaped separator in R

I have to read a file with R, where a variable number of columns is separated by the | character. However, if it is preceded by a \ it should not be considered a separator.
I first thought something like strsplit(x, "[^\\][|]") would work, but the problem here is that the character before each pipe is "consumed":
> strsplit("word1|word2|word3\\|aha!|word4", "[^\\][|]")
[[1]]
[1] "word" "word" "word3\\|aha" "word4"
Can anyone suggest a way to do this? Ideally it should be vectorized since the files in question are very large.
I believe this works; using Anirudh's downvoted answer (not sure why the downvote, it doesn't work but the regex was correct)
strsplit(x, "(?<!\\\\)[|]", perl=TRUE)
## > strsplit(x, "(?<!\\\\)[|]", perl=TRUE)
## [[1]]
## [1] "word1" "word2" "word3\\|aha!" "word4"
You need to use zero width assertion(lookbehind)
(?<!\\\\)[|]