How do I replace brackets using regular expressions in R? - regex

I'm sure this is a really easy question. I'm quite familiar with RegEx in R in the meantime, but I just can't get my head around this one.
Suppose, we have this string:
a <- c("a b . ) ] \"")
Now, all I want to do is to delete the quotes, the dot, the closing paranthesis and the closing brackets.
So, I want: "a b".
I tried:
gsub("[.\\)\"\\]]", "", a)
It doesn't work. It returns: "a b . ) ]" So nothing gets removed.
As soon as I exclude the \\] from the search pattern, it works...
gsub("[.\\)\"]", "", a)
But, of course, it doesn't remove the closing brackets!
What have I done wrong?!?
Thanks for your help!

a <- c('a b . ) ] "');
gsub('\\s*[].)"]\\s*','',a);
## [1] "a b"
When you want to include the close bracket character in a bracket expression you should always include it first within the brackets; that causes it to be taken as a character within the bracket expression, rather than as the closing delimiter of the bracket expression.

Building on #akruns comment
library(stringr)
str_trim(gsub('[.]|[[:punct:]]', '\\1', a))
replace the period in the first set of brackets with whichever punctuations you want to keep.

You may try this.
> gsub("\\b\\W\\b(*SKIP)(*F)|\\W", "", a, perl=T)
[1] "a b"
> gsub("\\b(\\W)\\b|\\W", "\\1", a, perl=T)
[1] "a b"

Related

R regex remove unicode apostrophe

Lets say I have the following string in R:
text <- "[Peanut M&M\u0092s]"
I've been trying to use regex to erase the apostrophe by searching for and deleting \u0092:
replaced <- gsub("\\\\u0092", "", text )
However, the above doesnt seem to work and results in the same line as the original. What is the correct way to do this removal?
Furthermore, if I wanted to remove the opening and closing [], is it more efficient to do it all in one go or on separate lines?
You can use a [^[:ascii:]] construct with a Perl-like regex to remove the non-ASCII codes from your input, and you can add an alternative [][] to also match square brackets:
text <- "[Peanut M&M\u0092s]"
replaced <- gsub("[][]|[^[:ascii:]]", "", text, perl=T)
replaced
## => [1] "Peanut M&Ms"
See IDEONE demo
If you only plan to remove the \0092 symbol, you do not need a Perl like regex:
replaced <- gsub("[][\u0092]", "", text)
See another demo
Note that [...] is a character class that matches 1 symbol, here, either a ] or [, or \u0092. If you place ] at the beginning of the character class, it does not need escaping. [ does not need escaping inside a character class (in R regex and in some other flavors, too).

strsplit by parentheses [duplicate]

This question already has answers here:
Regular Expression to get a string between parentheses in Javascript
(10 answers)
Closed 7 years ago.
Suppose I have a string like "A B C (123-456-789)", I'm wondering what's the best way to retrieve "123-456-789" from it.
strsplit("A B C (123-456-789)", "\\(")
[[1]]
[1] "A B C" "123-456-789)"
If we want to extract the digits with - between the braces, one option is str_extract. If there are multiple patterns within a string, use str_extract_all
library(stringr)
str_extract(str1, '(?<=\\()[0-9-]+(?=\\))')
#[1] "123-456-789"
str_extract_all(str2, '(?<=\\()[0-9-]+(?=\\))')
In the above codes, we are using regex lookarounds to extract the numbers and the -. The positive lookbehind (?<=\\()[0-9-]+ matches numbers along with - ([0-9-]+) in (123-456-789 and not in 123-456-789. Similarly the lookahead ('[0-9-]+(?=\)') matches numbers along with - in 123-456-789) and not in 123-456-798. Taken together it matches all the cases that satisfy both the conditions (123-456-789) and extract those in between the lookarounds and not with cases like (123-456-789 or 123-456-789)
With strsplit you can specify the split as [()]. We keep the () inside the square brackets to [] to treat it as characters or else we have to escape the parentheses ('\\(|\\)').
strsplit(str1, '[()]')[[1]][2]
#[1] "123-456-789"
If there are multiple substrings to extract from a string, we could loop with lapply and extract the numeric split parts with grep
lapply(strsplit(str2, '[()]'), function(x) grep('\\d', x, value=TRUE))
Or we can use stri_split from stringi which has the option to remove the empty strings as well (omit_empty=TRUE).
library(stringi)
stri_split_regex(str1, '[()A-Z ]', omit_empty=TRUE)[[1]]
#[1] "123-456-789"
stri_split_regex(str2, '[()A-Z ]', omit_empty=TRUE)
Another option is rm_round from qdapRegex if we are interested in extracting the contents inside the brackets.
library(qdapRegex)
rm_round(str1, extract=TRUE)[[1]]
#[1] "123-456-789"
rm_round(str2, extract=TRUE)
data
str1 <- "A B C (123-456-789)"
str2 <- c("A B C (123-425-478) A", "ABC(123-423-428)",
"(123-423-498) ABCDD",
"(123-432-423)", "ABC (123-423-389) GR (124-233-848) AK")
or with sub from base R:
sub("[^(]+\\(([^)]+)\\).*", "\\1", "A B C (123-456-789)")
#[1] "123-456-789"
Explanation:
[^(]+ : matches anything except an opening bracket
\\( : matches an opening bracket, which is just before what you want
([^)]+) : matches the pattern you want to capture (which is then retrieved in replacement="\\1"), which is anything except a closing bracket
\\).* matches a closing bracket followed by anything, 0 or more times
Another option with look-ahead and look-behind
sub(".*(?<=\\()(.+)(?=\\)).*", "\\1", "A B C (123-456-789)", perl=TRUE)
#[1] "123-456-789"
The capture groups in sub will target your desired output:
sub('.*\\((.*)\\).*', '\\1', str1)
[1] "123-456-789"
Extra check to make sure I pass #akrun's extended example:
sub('.*\\((.*)\\).*', '\\1', str2)
[1] "123-425-478" "123-423-428" "123-423-498" "123-432-423" "124-233-848"
You may try these gsub functions.
> gsub("[^\\d-]", "", x, perl=T)
[1] "123-456-789"
> gsub(".*\\(|\\)", "", x)
[1] "123-456-789"
> gsub("[^0-9-]", "", x)
[1] "123-456-789"
Few more...
> gsub("[0-9-](*SKIP)(*F)|.", "", x, perl=T)
[1] "123-456-789"
> gsub("(?:(?![0-9-]).)*", "", x, perl=T)
[1] "123-456-789"
Try this also:
k<-"A B C (123-456-789)"
regmatches(k,gregexpr("*.(\\d+).*",k))[[1]]
[1] "(123-456-789)"
With suggestion from #Arun:
regmatches(k, gregexpr('(?<=\\()[^A-Z ]+(?=\\))', k, perl=TRUE))[[1]]
With suggestion from #akrun:
regmatches(k, gregexpr('[0-9-]+', k))[[1]]

R Regex: Parenthesis Not Acting as Metacharacter

I am trying to split a string by the group "%in%" and the character "#". All documentation and everything I can find says that parenthesis are metacharacters used for grouping in R regex. So the code
> strsplit('example%in%aa(bbb)aa#cdef', '[(%in%)#]', perl=TRUE)
SHOULD give me
[[1]]
[1] "example" "aa(bbb)aa" "cdef"
That is, it should leave the parentheses in "aa(bbb)aa" alone, because the parentheses in the matching expression are not escaped. But instead it ACTUALLY gives me
[[1]]
[1] "example" "" "" "" "aa" "bbb" "aa" "cdef"
as if the parentheses were not metacharacters! What is up with this and how can I fix it? Thanks!
This is true with and without the argument perl=TRUE in strsplit.
Not sure what documentation you're reading, but the Extended Regular Expressions section in ?regex says:
Most metacharacters lose their special meaning inside a character class. ...
(Only '^ - \ ]' are special inside character classes.)
You don't need to create a character class. Just use "or" | (you likely don't need to group "%in%" either, but it shouldn't hurt anything):
> strsplit('example%in%aa(bbb)aa#cdef', '(%in%)|#', perl=TRUE)
[[1]]
[1] "example" "aa(bbb)aa" "cdef"
No need to use [ or ( here , just this :
strsplit('example%in%aa(bbb)aa#cdef', '%in%|#')
[[1]]
[1] "example" "aa(bbb)aa" "cdef"
Inside character class [], most of the characters lose their special meaning, including ().
You might want this regex instead:
'%in%|#'

String Editing in R - Trouble with Parentheses

so I'm editing some strings in R, and I would like to delete everything that is in parentheses from a string. The problem is, I'm not very savvy with regular expressions, and it seems that any time I want to use gsub to mess with parentheses, it doesn't work, or doesn't yield the correct result.
Any hints? I have a feeling its a solvable problem. Might there be a function that I can use that isn't gsub?
Ex. Strings: "abc def (foo) abc (def)" should be stripped to "abc def abc"
If the only way to do this is to specify whats in the parentheses, that would be fine as well.
Just another way:
x <- "abc def (foo) abc (def)"
gsub(" *\\(.*?)", "", x)
You need to escape the ( with a \ in regular expressions. In R, you need to escape twice \\. And then you search for anything (.*) after the ( in a non-greedy manner, with a ? after .* followed by ) (which you don't have to escape.
Parentheses are usually special characters in regular expressions, and also in those used by R. You have to escape them with the backslash \. The trouble is that the backslash needs to be escaped in R strings as well, with a second backslash, which leads to the following rather clumsy construction:
gsub(" *\\([^)]*\\) *", " ", "abc def (foo) abc (def)")
Careful with spaces, these are not handled correctly by my gsub call.
The bracketX function in the qdap package was designed for this problem:
library(qdap)
x <- "abc def (foo) abc (def)"
bracketX(x, "round")
## > bracketX(x, "round")
## [1] "abc def abc"

Separate a sentence into words and endmarks

I want to break a sentence apart into words and end marks (assume all other punctuation has been removed). I've written a working function to break string(s) apart as described but I think the part:
unlist(c(strsplit(x, "[^[:alnum:]'\"]", perl = T), substring(x, nchar(x), nchar(x))))
is a cob job that can be better achieved without using the substring and just splitting on spaces and between the endmark with an or | statement of sorts but don't know how I'd achieve this. Any direction with this would be appreciated.
breaker <- function(string) {
FUN <- function(x) {
unlist(c(strsplit(x, "[^[:alnum:]'\"]", perl = T), substring(x,
nchar(x), nchar(x))))
}
lapply(string, FUN)
}
#EXAMPLES
x <- "I'm liking it!"
breaker(x)
y <- c("I'm liking it!", "How much do you like it?", "I'd say it's awesome.")
breaker(y)
Here is a regex pattern that'll do the whole job on its own. It will match (and thus allow strsplit() to split the string) either at a space or right before one of the sentence-ending punctuation marks.
pat <- "[[:space:]]|(?=[.!?])"
The first half of the pattern matches space characters, and any match will cause strsplit() to 'eat up' the matched characters when it splits the string. The second half of the pattern (the part inside of the (?=...)) matches sentence-ending punctuation. It is an example of a 'zero-width positive lookahead assertion' (see ?regexp for details), and as such, will not lead strsplit() to 'eat up' the matching punctuation.
For your example vectors, you don't even need the call to lapply():
breaker <- function(X) {
strsplit(X, "[[:space:]]|(?=[.!?])", perl=TRUE)
}
x <- "I'm liking it!"
breaker(x)
y <- c("I'm liking it!", "How much do you like it?", "I'd say it's awesome.")
breaker(y)
you can also use scan_tokenizer() and MC_tokenizer() from the tm package
> library(tm)
> ?MC_tokenizer
> MC_tokenizer("what are the number of words in this sentence?")
[1] "what" "are" "the" "number" "of" "words" "in"
[8] "this" "sentence"