equivalent regular expression to remove all punctuations - regex

In R, to remove punctuation from a string, I can do this:
x <- 'a#,g:?s!*$t/{u}\d\&y'
gsub('[[:punct:]]','',x)
[1] "agstudy"
This is smart but I don't have tight control about the removed punctuations (imagine I want to keep some symbols in my character). How can I rewrite this gsub in a more more explicit way without forgetting any symbol, something like this:
gsub('[#,:?!*$/{}\\&]','',x,perl=FALSE)
EDIT
The difficulty I encountered is how to write the regular expression (I prefer in R) that removes all punctuation characters from x, and keep only # for example:
"a#gstudy"

Using a negative lookahead assertion:
x <- 'a#,g:?s!*$t/{u}\\d\\&y'
gsub('(?!#)[[:punct:]]','',x, perl=TRUE)
# [1] "a#gstudy"
This in essence tests each character twice, asking once from the preceding intercharacter space whether the next character is something other than a "#" and then, from the character itself, whether it is a punctuation symbol. If both tests are true, a match is registered and the character is removed.

You can use a negated character class, example:
\pP is the unicode character class for punctuations characters.
\PP is all that is not a punctuation character.
[^\PP] is all that is a punctuation character.
[^\PP~] is all that is a punctuation character except tilde.
Note: you can stay in the ASCII range by using \p{PosixPunct}:
[^\P{PosixPunct}~]
or use unicode punctuations characters with this particularity in the ASCII range with \p{XPosixPunct}:
[^\P{XPosixPunct}~]

Reading at this page indicates that the [[:punct:]] characters should include:
[-!"#$%&'()*+,./:;<=>?#[\\\]^_`{|}~]
From the R ?regex page, we also get this as verification:
[:punct:]
Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~
Thus, you can possibly use that as your basis for creating your own pattern, excluding the characters you want to keep.
This is messy as heck especially with two much nicer answers, but I just wanted to show the silliness I had in mind:
Create a function that looks something like this:
newPunks <- function(CHARS) {
punks <- c("!", "\\\"", "#", "\\$", "%", "&", "'", "\\(", "\\)",
"\\*", "\\+", ",", "-", "\\.", "/", ":", ";", "<",
"=", ">", "\\?", "#", "\\[", "\\\\", "\\]", "\\^", "_",
"`", "\\{", "\\|", "\\}", "~")
keepers <- strsplit(CHARS, "")[[1]]
keepers <- ifelse(keepers %in% c("\"", "$", "{", "}", "(", ")",
"*", "+", ".", "?", "[", "]",
"^", "|", "\\"), paste0("\\", keepers), keepers)
paste(setdiff(punks, keepers), collapse="|")
}
Usage:
gsub(newPunks("#"), "", x)
# [1] "a#gstudy"
gsub(newPunks(""), "", x)
# [1] "agstudy"
gsub(newPunks("&#{"), "", x)
# [1] "a#gst{ud&y"
Bleah. Time for me to go to bed....

It works exactly the same in Perl, [:punct:] is a POSIX character class that simply maps to:
[!"#$%&'()*+,\-./:;<=>?#[\\\]^_`{|}~]
The equivalent Perl version would be:
my $x = 'a#,g:?s!*$t/{u}\d\&y';
$x =~ s/[[:punct:]]//g;
print $x;
__END__
agstudy

The straightforward approach is to use a lookahead or a lookbehind to match the same character twice, once to make sure it's a punction, and once to make sure it's not "#".
(?=[^#])[[:punct:]]
or
(?!#)[[:punct:]]
Lookahead and lookbehinds are a little expensive, though. Rather than using a lookaround at every position, it's more efficient to only use one when we find a punctuation.
[[:punct:]](?<!#)
Of course, it's even more efficient to get rid of lookarounds completely. This can be achieved through double-negation.
[^[:^punct:]#]
I haven't tested these with R, but they should at least work with perl=TRUE.

Related

R regex remove unicode apostrophe

Lets say I have the following string in R:
text <- "[Peanut M&M\u0092s]"
I've been trying to use regex to erase the apostrophe by searching for and deleting \u0092:
replaced <- gsub("\\\\u0092", "", text )
However, the above doesnt seem to work and results in the same line as the original. What is the correct way to do this removal?
Furthermore, if I wanted to remove the opening and closing [], is it more efficient to do it all in one go or on separate lines?
You can use a [^[:ascii:]] construct with a Perl-like regex to remove the non-ASCII codes from your input, and you can add an alternative [][] to also match square brackets:
text <- "[Peanut M&M\u0092s]"
replaced <- gsub("[][]|[^[:ascii:]]", "", text, perl=T)
replaced
## => [1] "Peanut M&Ms"
See IDEONE demo
If you only plan to remove the \0092 symbol, you do not need a Perl like regex:
replaced <- gsub("[][\u0092]", "", text)
See another demo
Note that [...] is a character class that matches 1 symbol, here, either a ] or [, or \u0092. If you place ] at the beginning of the character class, it does not need escaping. [ does not need escaping inside a character class (in R regex and in some other flavors, too).

Regex with stringr:: how to find first instance of pattern

Behind this question is an effort to extract all references created by knitr and latex. Not finding another way, my thought was to read into R the .Rnw script and use a regular expression to find references -- where the latex syntax is \ref{caption referenced to}. My script has 250+ references, and some are very close to each other.
The text.1 example below works, but not the text example. I think it has to do with R chugging along to the final closing brace. How do I stop at the first closing brace and extract what preceded it to the opening brace?
library(stringr)
text.1 <- c(" \\ref{test}", "abc", "\\ref{test2}", " \\section{test3}", "{test3")
# In the regular expression below, look back and if find "ref{", grab everything until look behind for } at end
# braces are special characters and require escaping with double backslacs for R to recognize them as braces
# unlist converts the list returned by str_extract to a vector
unlist(str_extract_all(string = text.1, pattern = "(?<=ref\\{).*(?=\\}$)"))
[1] "test" "test2"
# a more complicated string, with more than one set of braces in an element
text <- c("text \ref{?bar labels precision} and more text \ref{?table column alignment}", "text \ref{?table space} }")
unlist(str_extract_all(string = text, pattern = "(?<=ref\\{).*(?=\\}$)"))
character(0)
The problem with text is the backslash in front of "ref" is being interpreted as a carriage return \r by the engine and R's parser; so you're trying to match "ref" but it's really (CR + "ef") ...
Also * is greedy by default, meaning it will match as much as it can and still allow the remainder of the regular expression to match. Use *? or a negated character class to prevent greediness.
unlist(str_extract_all(text, '(?<=\ref\\{)[^}]*'))
# [1] "?bar labels precision" "?table column alignment" "?table space"
As you can see, you can use a character class to match either (\r or r + "ef") ...
x <- c(' \\ref{test}', 'abc', '\\ref{test2}', ' \\section{test3}', '{test3',
'text \ref{?bar labels precision} and more text \ref{?table column alignment}',
'text \ref{?table space} }')
unlist(str_extract_all(x, '(?<=[\rr]ef\\{)[^}]*'))
# [1] "test" "test2" "?bar labels precision"
# [4] "?table column alignment" "?table space"
EDITED
The reason why it didn't capture what is before the closing brace } is because you added an end of line anchor $. Remove $ and it would work.
Therefore, you new code should be like this
unlist(str_extract_all(string = text, pattern = "(?<=ref\\{)[^}]*(?=\\})"))
See DEMO

Function to turn a string/regexp into a save filename [duplicate]

This question already has answers here:
How do I deal with special characters like \^$.?*|+()[{ in my regex?
(2 answers)
Closed 8 years ago.
I have a list of regexps which are used to produce some graphs. I'd like to save the graphs with it's corresponding regexp in the filename. Example:
re <- 'foo\\w{3}bar'
# ... produce a graph here and now need a filename
not_save <- paste0("pefix ", re, ".suffix")
But re needs to be cleaned from everyting not allowed in filenames. I know, it's OS and filesystem related, but I think if it's a valid filename on Windows it's valid everywhere.
I can substitute bad characters with gsub():
not_save_enough <- gsub('[$^*|{}()/:]', '_', re, perl=TRUE)
But I don't know all bad chars and don't know how to replace \ and/or [ and ]. Substitute all bad chars by _ would be sufficient. Unfortunally things like
gsub('\Q\\E', '_', "Numbers are \d", perl=TRUE)
arn't working even with perl = TRUE and produce
Error: '\Q' is an unrecognized escape in character string starting "'/\Q"
Is there a function like make_string_save_to_use_it_as_filename()?
How to substitude \, [ and ] and other regexp-meta-chars in strings?
How to substitude \ [ ] and other metacharacters in strings?
If the idea here is to replace bad characters, you may consider the POSIX class [[:punct:]]. This POSIX named class in the ASCII range matches all non-controls, non-alphanumeric, non-space characters.
!\"#$%&'()*+,-./:;<=>?#[\\]^_`{|}~
So if you're wanting to replace each instance with an underscore you could do ...
fn <- gsub('[[:punct:]]', '_', 'foo\\w{3}bar')
# [1] "foo_w_3_bar"
The use of \Q and \E ensures that any character between will be matched literally and not interpreted as a metacharacter by the regular expression engine. Also in R the delimiter /.../ and g (global) mode modifier syntax is invalid. Below is an example demonstrating the correct use:
x <- '[[[(((123]'
gsub('\\Q[[[(((\\E', '[', x, perl=T)
# [1] "[123]"
If you need to use modifiers, ensure perl=TRUE is turned on and use inline modifiers i.e. (?ismx)
I think you want something like this,
> re <- 'foo\\w{3}bar'
> not_save_enough <- gsub('[$^*|{}\\[\\]()/:\\\\]', '_', re, perl=TRUE)
> not_save_enough
[1] "foo_w_3_bar"
> re <- 'foo\\w{3}bar[foo]foo(buz)kj^jkj$jhh*foo|bar/hjh'
> not_save_enough <- gsub('[$^*|{}\\[\\]()/:\\\\]', '_', re, perl=TRUE)
> not_save_enough
[1] "foo_w_3_bar_foo_foo_buz_kj_jkj_jhh_foo_bar_hjh"
In R regex, you need to escape backslash three more times to match a literal backslash.

I would like to use gsub in R to match all items which are not alphanumeric

I am searching raw twitter snippets using R but keep getting issues where there are non standard Alphanumeric chars such as the following "🏄".
I would like to take out all non [abcdefghijklmnopqrstuvwxyz0123456789] characters using gsub.
Can you use gsub to specify a replace for those items NOT in [abcdefghijklmnopqrstuvwxyz0123456789]?
You could simply negate you pattern with [^ ...]:
x <- "abcde🏄fgh"
gsub("[^A-Za-z0-9]", "", x)
# [1] "abcdefgh"
Please note that the class [:alnum:] matches all your given special characters. That's why gsub("[^[:alnum:]]", "", x) doesn't work.

Using variable to create regular expression pattern in R

I have a function:
ncount <- function(num = NULL) {
toRead <- readLines("abc.txt")
n <- as.character(num)
x <- grep("{"n"} number",toRead,value=TRUE)
}
While grep-ing, I want the num passed in the function to dynamically create the pattern to be searched? How can this be done in R? The text file has number and text in every line
You could use paste to concatenate strings:
grep(paste("{", n, "} number", sep = ""),homicides,value=TRUE)
In order to build a regular expression from variables in R, in the current scenarion, you may simply concatenate string literals with your variable using paste0:
grep(paste0('\\{', n, '} number'), homicides, value=TRUE)
Note that { is a special character outside a [...] bracket expression (also called character class), and should be escaped if you need to find a literal { char.
In case you use a list of items as an alternative list, you may use a combination of paste/paste0:
words <- c('bananas', 'mangoes', 'plums')
regex <- paste0('Ben likes (', paste(words, collapse='|'), ')\\.')
The resulting Ben likes (bananas|mangoes|plums)\. regex will match Ben likes bananas., Ben likes mangoes. or Ben likes plums.. See the R demo and the regex demo.
NOTE: PCRE (when you pass perl=TRUE to base R regex functions) or ICU (stringr/stringi regex functions) have proved to better handle these scenarios, it is recommended to use those engines rather than the default TRE regex library used in base R regex functions.
Oftentimes, you will want to build a pattern with a list of words that should be matched exactly, as whole words. Here, a lot will depend on the type of boundaries and whether the words can contain special regex metacharacters or not, whether they can contain whitespace or not.
In the most general case, word boundaries (\b) work well.
regex <- paste0('\\b(', paste(words, collapse='|'), ')\\b')
unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE)))
## => [1] "bananas" "mangoes" "plums"
The \b(bananas|mangoes|plums)\b pattern will match bananas, but won't match banana (see an R demo).
If your list is like
words <- c('cm+km', 'uname\\vname')
you will have to escape the words first, i.e. append \ before each of the metacharacter:
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- c('Text: cm+km, and some uname\\vname?')
words <- c('cm+km', 'uname\\vname')
regex <- paste0('\\b(', paste(regex.escape(words), collapse='|'), ')\\b')
cat( unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE))) )
## => cm+km uname\vname
If your words can start or end with a special regex metacharacter, \b word boundaries won't work. Use
Unambiguous word boundaries, (?<!\w) / (?!\w), when the match is expected between non-word chars or start/end of string
Whitespace boundaries, (?<!\S) / (?!\S), when the match is expected to be enclosed with whitespace chars, or start/end of string
Build your own using the lookbehind/lookahead combination and your custom character class / bracket expression, or even more sophisticad patterns.
Example of the first two approaches in R (replacing with the match enclosed with << and >>):
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- 'Text: cm+km, +km and C++,Delphi,C++CLI and C++/CLI.'
words <- c('+km', 'C++')
# Unambiguous word boundaries
regex <- paste0('(?<!\\w)(', paste(regex.escape(words), collapse='|'), ')(?!\\w)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and <<C++>>,Delphi,C++CLI and <<C++>>/CLI."
# Whitespace boundaries
regex <- paste0('(?<!\\S)(', paste(regex.escape(words), collapse='|'), ')(?!\\S)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and C++,Delphi,C++CLI and C++/CLI."