R regex remove unicode apostrophe - regex

Lets say I have the following string in R:
text <- "[Peanut M&M\u0092s]"
I've been trying to use regex to erase the apostrophe by searching for and deleting \u0092:
replaced <- gsub("\\\\u0092", "", text )
However, the above doesnt seem to work and results in the same line as the original. What is the correct way to do this removal?
Furthermore, if I wanted to remove the opening and closing [], is it more efficient to do it all in one go or on separate lines?

You can use a [^[:ascii:]] construct with a Perl-like regex to remove the non-ASCII codes from your input, and you can add an alternative [][] to also match square brackets:
text <- "[Peanut M&M\u0092s]"
replaced <- gsub("[][]|[^[:ascii:]]", "", text, perl=T)
replaced
## => [1] "Peanut M&Ms"
See IDEONE demo
If you only plan to remove the \0092 symbol, you do not need a Perl like regex:
replaced <- gsub("[][\u0092]", "", text)
See another demo
Note that [...] is a character class that matches 1 symbol, here, either a ] or [, or \u0092. If you place ] at the beginning of the character class, it does not need escaping. [ does not need escaping inside a character class (in R regex and in some other flavors, too).

Related

Regex with stringr:: how to find first instance of pattern

Behind this question is an effort to extract all references created by knitr and latex. Not finding another way, my thought was to read into R the .Rnw script and use a regular expression to find references -- where the latex syntax is \ref{caption referenced to}. My script has 250+ references, and some are very close to each other.
The text.1 example below works, but not the text example. I think it has to do with R chugging along to the final closing brace. How do I stop at the first closing brace and extract what preceded it to the opening brace?
library(stringr)
text.1 <- c(" \\ref{test}", "abc", "\\ref{test2}", " \\section{test3}", "{test3")
# In the regular expression below, look back and if find "ref{", grab everything until look behind for } at end
# braces are special characters and require escaping with double backslacs for R to recognize them as braces
# unlist converts the list returned by str_extract to a vector
unlist(str_extract_all(string = text.1, pattern = "(?<=ref\\{).*(?=\\}$)"))
[1] "test" "test2"
# a more complicated string, with more than one set of braces in an element
text <- c("text \ref{?bar labels precision} and more text \ref{?table column alignment}", "text \ref{?table space} }")
unlist(str_extract_all(string = text, pattern = "(?<=ref\\{).*(?=\\}$)"))
character(0)
The problem with text is the backslash in front of "ref" is being interpreted as a carriage return \r by the engine and R's parser; so you're trying to match "ref" but it's really (CR + "ef") ...
Also * is greedy by default, meaning it will match as much as it can and still allow the remainder of the regular expression to match. Use *? or a negated character class to prevent greediness.
unlist(str_extract_all(text, '(?<=\ref\\{)[^}]*'))
# [1] "?bar labels precision" "?table column alignment" "?table space"
As you can see, you can use a character class to match either (\r or r + "ef") ...
x <- c(' \\ref{test}', 'abc', '\\ref{test2}', ' \\section{test3}', '{test3',
'text \ref{?bar labels precision} and more text \ref{?table column alignment}',
'text \ref{?table space} }')
unlist(str_extract_all(x, '(?<=[\rr]ef\\{)[^}]*'))
# [1] "test" "test2" "?bar labels precision"
# [4] "?table column alignment" "?table space"
EDITED
The reason why it didn't capture what is before the closing brace } is because you added an end of line anchor $. Remove $ and it would work.
Therefore, you new code should be like this
unlist(str_extract_all(string = text, pattern = "(?<=ref\\{)[^}]*(?=\\})"))
See DEMO

regular expression -- greedy matching?

I am trying to extract a leading string by stripping off an optional trailing string, where the trailing strings are a subset of possible leading strings but not vice versa. Suppose the leading string is like [a-z]+ and the trailing string is like c. Thus from "abc" I want to extract "ab", and from "ab" I also want to get "ab". Something like this:
^([a-z]+)(?:c|)
The problem is that the [a-z]+ matches the entire string, using the empty option in the alternative, so the grabbed value is "abc" or "ab". (The (?: tells it not to grab the second part.) I want some way to make it take the longer option, or the first option, in the alternative, and use that to determine what matches the first part.
I have also tried putting the desired target inside both of the alternatives:
^([a-z]+)c|^([a-z]+)
I think that it should prefer to match the first one of the two possible alternatives, but I get the same results as above.
I am doing this in R, so I can use either the POSIX or the Perl regex library.
(The actual problem involves futures trading symbols. These have a root "instrument name" like [A-Z0-9]+, followed by an "expiration code" like [FGHJKMNQUVXZ][0-9]{1,2}. Given a symbol like "ZNH3", I want to strip the "H3" to get "ZN". But if I give it "ZN" I also want to get back "ZN".)
Try this:
> library(gsubfn)
> strapplyc(c("abc", "abd"), "^(\\w+?)c?$", simplify = TRUE)
[1] "ab" "abd"
and even easier:
> sub("c$", "", c("abc", "abd"))
[1] "ab" "abd"
Here's a working regular expression:
vec <- c("ZNH3", "ZN", "ZZZ33", "ABF")
sub("(\\w+)[FGHJKMNQUVXZ]\\d{1,2}", "\\1", vec)
# [1] "ZN" "ZN" "ZZ" "ABF"
A variation on the non-greedy answers using base code only.
codes <- c("ZNH3", "CLZ4")
matched <- regmatches(codes, regexec("^([A-Z0-9]+?)[FGHJKMNQUVXZ][0-9]{1,2}$", codes))
# [[1]]
# [1] "ZNH3" "ZN"
#
# [[2]]
# [1] "CLZ4" "CL"
sapply(matched, `[[`, 2) # extract just codes
# [1] "ZN" "CL"
Use a 'non-greedy' match for the first part of the regex, followed by the definitions of your 'optional allowed suffixes' anchored by the 'end-of-string'...
This regex (.+?)([FGHJKMNQUVXZ][0-9]{1,2})?$matches...
(.+?) as few characters as possible
([FGHJKMNQUVXZ][0-9]{1,2})? followed by an allowable (but optional) suffix
$ followed by the end of string
The required result is in the first captured element of the match (however that may be referenced in 'r') :-)

equivalent regular expression to remove all punctuations

In R, to remove punctuation from a string, I can do this:
x <- 'a#,g:?s!*$t/{u}\d\&y'
gsub('[[:punct:]]','',x)
[1] "agstudy"
This is smart but I don't have tight control about the removed punctuations (imagine I want to keep some symbols in my character). How can I rewrite this gsub in a more more explicit way without forgetting any symbol, something like this:
gsub('[#,:?!*$/{}\\&]','',x,perl=FALSE)
EDIT
The difficulty I encountered is how to write the regular expression (I prefer in R) that removes all punctuation characters from x, and keep only # for example:
"a#gstudy"
Using a negative lookahead assertion:
x <- 'a#,g:?s!*$t/{u}\\d\\&y'
gsub('(?!#)[[:punct:]]','',x, perl=TRUE)
# [1] "a#gstudy"
This in essence tests each character twice, asking once from the preceding intercharacter space whether the next character is something other than a "#" and then, from the character itself, whether it is a punctuation symbol. If both tests are true, a match is registered and the character is removed.
You can use a negated character class, example:
\pP is the unicode character class for punctuations characters.
\PP is all that is not a punctuation character.
[^\PP] is all that is a punctuation character.
[^\PP~] is all that is a punctuation character except tilde.
Note: you can stay in the ASCII range by using \p{PosixPunct}:
[^\P{PosixPunct}~]
or use unicode punctuations characters with this particularity in the ASCII range with \p{XPosixPunct}:
[^\P{XPosixPunct}~]
Reading at this page indicates that the [[:punct:]] characters should include:
[-!"#$%&'()*+,./:;<=>?#[\\\]^_`{|}~]
From the R ?regex page, we also get this as verification:
[:punct:]
Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~
Thus, you can possibly use that as your basis for creating your own pattern, excluding the characters you want to keep.
This is messy as heck especially with two much nicer answers, but I just wanted to show the silliness I had in mind:
Create a function that looks something like this:
newPunks <- function(CHARS) {
punks <- c("!", "\\\"", "#", "\\$", "%", "&", "'", "\\(", "\\)",
"\\*", "\\+", ",", "-", "\\.", "/", ":", ";", "<",
"=", ">", "\\?", "#", "\\[", "\\\\", "\\]", "\\^", "_",
"`", "\\{", "\\|", "\\}", "~")
keepers <- strsplit(CHARS, "")[[1]]
keepers <- ifelse(keepers %in% c("\"", "$", "{", "}", "(", ")",
"*", "+", ".", "?", "[", "]",
"^", "|", "\\"), paste0("\\", keepers), keepers)
paste(setdiff(punks, keepers), collapse="|")
}
Usage:
gsub(newPunks("#"), "", x)
# [1] "a#gstudy"
gsub(newPunks(""), "", x)
# [1] "agstudy"
gsub(newPunks("&#{"), "", x)
# [1] "a#gst{ud&y"
Bleah. Time for me to go to bed....
It works exactly the same in Perl, [:punct:] is a POSIX character class that simply maps to:
[!"#$%&'()*+,\-./:;<=>?#[\\\]^_`{|}~]
The equivalent Perl version would be:
my $x = 'a#,g:?s!*$t/{u}\d\&y';
$x =~ s/[[:punct:]]//g;
print $x;
__END__
agstudy
The straightforward approach is to use a lookahead or a lookbehind to match the same character twice, once to make sure it's a punction, and once to make sure it's not "#".
(?=[^#])[[:punct:]]
or
(?!#)[[:punct:]]
Lookahead and lookbehinds are a little expensive, though. Rather than using a lookaround at every position, it's more efficient to only use one when we find a punctuation.
[[:punct:]](?<!#)
Of course, it's even more efficient to get rid of lookarounds completely. This can be achieved through double-negation.
[^[:^punct:]#]
I haven't tested these with R, but they should at least work with perl=TRUE.

I would like to use gsub in R to match all items which are not alphanumeric

I am searching raw twitter snippets using R but keep getting issues where there are non standard Alphanumeric chars such as the following "🏄".
I would like to take out all non [abcdefghijklmnopqrstuvwxyz0123456789] characters using gsub.
Can you use gsub to specify a replace for those items NOT in [abcdefghijklmnopqrstuvwxyz0123456789]?
You could simply negate you pattern with [^ ...]:
x <- "abcde🏄fgh"
gsub("[^A-Za-z0-9]", "", x)
# [1] "abcdefgh"
Please note that the class [:alnum:] matches all your given special characters. That's why gsub("[^[:alnum:]]", "", x) doesn't work.

Using variable to create regular expression pattern in R

I have a function:
ncount <- function(num = NULL) {
toRead <- readLines("abc.txt")
n <- as.character(num)
x <- grep("{"n"} number",toRead,value=TRUE)
}
While grep-ing, I want the num passed in the function to dynamically create the pattern to be searched? How can this be done in R? The text file has number and text in every line
You could use paste to concatenate strings:
grep(paste("{", n, "} number", sep = ""),homicides,value=TRUE)
In order to build a regular expression from variables in R, in the current scenarion, you may simply concatenate string literals with your variable using paste0:
grep(paste0('\\{', n, '} number'), homicides, value=TRUE)
Note that { is a special character outside a [...] bracket expression (also called character class), and should be escaped if you need to find a literal { char.
In case you use a list of items as an alternative list, you may use a combination of paste/paste0:
words <- c('bananas', 'mangoes', 'plums')
regex <- paste0('Ben likes (', paste(words, collapse='|'), ')\\.')
The resulting Ben likes (bananas|mangoes|plums)\. regex will match Ben likes bananas., Ben likes mangoes. or Ben likes plums.. See the R demo and the regex demo.
NOTE: PCRE (when you pass perl=TRUE to base R regex functions) or ICU (stringr/stringi regex functions) have proved to better handle these scenarios, it is recommended to use those engines rather than the default TRE regex library used in base R regex functions.
Oftentimes, you will want to build a pattern with a list of words that should be matched exactly, as whole words. Here, a lot will depend on the type of boundaries and whether the words can contain special regex metacharacters or not, whether they can contain whitespace or not.
In the most general case, word boundaries (\b) work well.
regex <- paste0('\\b(', paste(words, collapse='|'), ')\\b')
unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE)))
## => [1] "bananas" "mangoes" "plums"
The \b(bananas|mangoes|plums)\b pattern will match bananas, but won't match banana (see an R demo).
If your list is like
words <- c('cm+km', 'uname\\vname')
you will have to escape the words first, i.e. append \ before each of the metacharacter:
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- c('Text: cm+km, and some uname\\vname?')
words <- c('cm+km', 'uname\\vname')
regex <- paste0('\\b(', paste(regex.escape(words), collapse='|'), ')\\b')
cat( unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE))) )
## => cm+km uname\vname
If your words can start or end with a special regex metacharacter, \b word boundaries won't work. Use
Unambiguous word boundaries, (?<!\w) / (?!\w), when the match is expected between non-word chars or start/end of string
Whitespace boundaries, (?<!\S) / (?!\S), when the match is expected to be enclosed with whitespace chars, or start/end of string
Build your own using the lookbehind/lookahead combination and your custom character class / bracket expression, or even more sophisticad patterns.
Example of the first two approaches in R (replacing with the match enclosed with << and >>):
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- 'Text: cm+km, +km and C++,Delphi,C++CLI and C++/CLI.'
words <- c('+km', 'C++')
# Unambiguous word boundaries
regex <- paste0('(?<!\\w)(', paste(regex.escape(words), collapse='|'), ')(?!\\w)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and <<C++>>,Delphi,C++CLI and <<C++>>/CLI."
# Whitespace boundaries
regex <- paste0('(?<!\\S)(', paste(regex.escape(words), collapse='|'), ')(?!\\S)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and C++,Delphi,C++CLI and C++/CLI."