Function to turn a string/regexp into a save filename [duplicate] - regex

This question already has answers here:
How do I deal with special characters like \^$.?*|+()[{ in my regex?
(2 answers)
Closed 8 years ago.
I have a list of regexps which are used to produce some graphs. I'd like to save the graphs with it's corresponding regexp in the filename. Example:
re <- 'foo\\w{3}bar'
# ... produce a graph here and now need a filename
not_save <- paste0("pefix ", re, ".suffix")
But re needs to be cleaned from everyting not allowed in filenames. I know, it's OS and filesystem related, but I think if it's a valid filename on Windows it's valid everywhere.
I can substitute bad characters with gsub():
not_save_enough <- gsub('[$^*|{}()/:]', '_', re, perl=TRUE)
But I don't know all bad chars and don't know how to replace \ and/or [ and ]. Substitute all bad chars by _ would be sufficient. Unfortunally things like
gsub('\Q\\E', '_', "Numbers are \d", perl=TRUE)
arn't working even with perl = TRUE and produce
Error: '\Q' is an unrecognized escape in character string starting "'/\Q"
Is there a function like make_string_save_to_use_it_as_filename()?
How to substitude \, [ and ] and other regexp-meta-chars in strings?

How to substitude \ [ ] and other metacharacters in strings?
If the idea here is to replace bad characters, you may consider the POSIX class [[:punct:]]. This POSIX named class in the ASCII range matches all non-controls, non-alphanumeric, non-space characters.
!\"#$%&'()*+,-./:;<=>?#[\\]^_`{|}~
So if you're wanting to replace each instance with an underscore you could do ...
fn <- gsub('[[:punct:]]', '_', 'foo\\w{3}bar')
# [1] "foo_w_3_bar"
The use of \Q and \E ensures that any character between will be matched literally and not interpreted as a metacharacter by the regular expression engine. Also in R the delimiter /.../ and g (global) mode modifier syntax is invalid. Below is an example demonstrating the correct use:
x <- '[[[(((123]'
gsub('\\Q[[[(((\\E', '[', x, perl=T)
# [1] "[123]"
If you need to use modifiers, ensure perl=TRUE is turned on and use inline modifiers i.e. (?ismx)

I think you want something like this,
> re <- 'foo\\w{3}bar'
> not_save_enough <- gsub('[$^*|{}\\[\\]()/:\\\\]', '_', re, perl=TRUE)
> not_save_enough
[1] "foo_w_3_bar"
> re <- 'foo\\w{3}bar[foo]foo(buz)kj^jkj$jhh*foo|bar/hjh'
> not_save_enough <- gsub('[$^*|{}\\[\\]()/:\\\\]', '_', re, perl=TRUE)
> not_save_enough
[1] "foo_w_3_bar_foo_foo_buz_kj_jkj_jhh_foo_bar_hjh"
In R regex, you need to escape backslash three more times to match a literal backslash.

Related

R regex remove unicode apostrophe

Lets say I have the following string in R:
text <- "[Peanut M&M\u0092s]"
I've been trying to use regex to erase the apostrophe by searching for and deleting \u0092:
replaced <- gsub("\\\\u0092", "", text )
However, the above doesnt seem to work and results in the same line as the original. What is the correct way to do this removal?
Furthermore, if I wanted to remove the opening and closing [], is it more efficient to do it all in one go or on separate lines?
You can use a [^[:ascii:]] construct with a Perl-like regex to remove the non-ASCII codes from your input, and you can add an alternative [][] to also match square brackets:
text <- "[Peanut M&M\u0092s]"
replaced <- gsub("[][]|[^[:ascii:]]", "", text, perl=T)
replaced
## => [1] "Peanut M&Ms"
See IDEONE demo
If you only plan to remove the \0092 symbol, you do not need a Perl like regex:
replaced <- gsub("[][\u0092]", "", text)
See another demo
Note that [...] is a character class that matches 1 symbol, here, either a ] or [, or \u0092. If you place ] at the beginning of the character class, it does not need escaping. [ does not need escaping inside a character class (in R regex and in some other flavors, too).

Combining regex with a literal string

I have the following code:
input <- "1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-I1-1-I2-1-TR-1-I1-1-I2-1-FA-1-I3-1-I1-1-FA-1-FA-1-NR-1-I3-1-I2-1-TR-1-I1-1-I2-1-I1-1-I2-1-FA-1-I2-1-I1-1-I3-1-FA-1-QU-1-I1-1-I2-1-I2-1-I2-1-NR-1-I2-1-I2-1-NR-1-I1-1-I2-1-I1-1-NR-1-I3-1-QU-1-I2-1-I3-1-QU-1-NR-1-I2-1-I1-1-NR-1-QU-1-QU-1-I2-1-I1-1-EX"
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
innovation_patterns <- lapply(innovation_patterns, str_extract_all, '(?:I\\d-?)*I3(?:-?I\\d)*')
This outputs:
"I2-I3" "I3-I1" "I3-I2" "I2-I1-I3" "I3" "I2-I3"
However, I only want to extract matches to the regex that are following immediately to a specific string, e.g.:
only match the regex when it's preceded by the literal string FA-I2-I2-I2-EX.
This, for example, would be the first match of the regex, while the second match is preceded by FA-I1-I2-TR-I1-I2-FA.
The expected output is roughly the same as in the regex above, but only selecting one of the 5 matches, because it needs to be preceded by a specific literal string.
How can I modify this regex to achieve this purpose? I assume it needs to use a positive lookbehind to first identify the literal string, then execute the regex.
I don't know if I'm fully understanding what you mean, but it seems you could use positive lookbehind.
For instance:
(?<=a)b (positive lookbehind) matches the b (and only the b) in cab, but does not match bed or debt
There should be something more intuitive but i think this will do the job
literal <- "FA-I2-I2-I2-EX"
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
a <- lapply(strsplit(innovation_patterns, literal )[[1]], str_extract_all, '(?:I\\d-?)*I3(?:-?I\\d)*')
b <- lapply(2:length(a), function(x){
a[[x]][[1]][1]
})
print(b)
Use (*SKIP)(*F)
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
innovation_patterns <- lapply(innovation_patterns, str_extract_all, perl('FA-I1-I2-TR-I1-I2-FA.*(*SKIP)(*F)|(?:I\\d-?)*I3(?:-?I\\d)*'))
Syntax would be like,
partIDontWant.*(*SKIP)(*F)|choose from the string which exists before partIDontWant
DEMO
Here's is another way you could go about this.
x <- "1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-I1-1-I2-1-TR-1-I1-1-I2-1-FA-1-I3-1-I1-1-FA-1-FA-1-NR-1-I3-1-I2-1-TR-1-I1-1-I2-1-I1-1-I2-1-FA-1-I2-1-I1-1-I3-1-FA-1-QU-1-I1-1-I2-1-I2-1-I2-1-NR-1-I2-1-I2-1-NR-1-I1-1-I2-1-I1-1-NR-1-I3-1-QU-1-I2-1-I3-1-QU-1-NR-1-I2-1-I1-1-NR-1-QU-1-QU-1-I2-1-I1-1-EX"
CODE
substr <- 'FA-I2-I2-I2-EX'
regex <- paste0(substr, '-?((?:I\\d-?)*I3(?:-?I\\d)*)')
gsubfn::strapply(gsub('-1-', '-', x), regex, simplify = c)
## [1] "I2-I3"
Here's how to implement it:
lapply(innovation_patterns, str_extract_all, '(?<=FA-I2-I2-I2-EX-?)(?:I\\d-?)*I3(?:-?I\\d)*');
## [[1]]
## [[1]][[1]]
## [1] "I2-I3"

equivalent regular expression to remove all punctuations

In R, to remove punctuation from a string, I can do this:
x <- 'a#,g:?s!*$t/{u}\d\&y'
gsub('[[:punct:]]','',x)
[1] "agstudy"
This is smart but I don't have tight control about the removed punctuations (imagine I want to keep some symbols in my character). How can I rewrite this gsub in a more more explicit way without forgetting any symbol, something like this:
gsub('[#,:?!*$/{}\\&]','',x,perl=FALSE)
EDIT
The difficulty I encountered is how to write the regular expression (I prefer in R) that removes all punctuation characters from x, and keep only # for example:
"a#gstudy"
Using a negative lookahead assertion:
x <- 'a#,g:?s!*$t/{u}\\d\\&y'
gsub('(?!#)[[:punct:]]','',x, perl=TRUE)
# [1] "a#gstudy"
This in essence tests each character twice, asking once from the preceding intercharacter space whether the next character is something other than a "#" and then, from the character itself, whether it is a punctuation symbol. If both tests are true, a match is registered and the character is removed.
You can use a negated character class, example:
\pP is the unicode character class for punctuations characters.
\PP is all that is not a punctuation character.
[^\PP] is all that is a punctuation character.
[^\PP~] is all that is a punctuation character except tilde.
Note: you can stay in the ASCII range by using \p{PosixPunct}:
[^\P{PosixPunct}~]
or use unicode punctuations characters with this particularity in the ASCII range with \p{XPosixPunct}:
[^\P{XPosixPunct}~]
Reading at this page indicates that the [[:punct:]] characters should include:
[-!"#$%&'()*+,./:;<=>?#[\\\]^_`{|}~]
From the R ?regex page, we also get this as verification:
[:punct:]
Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~
Thus, you can possibly use that as your basis for creating your own pattern, excluding the characters you want to keep.
This is messy as heck especially with two much nicer answers, but I just wanted to show the silliness I had in mind:
Create a function that looks something like this:
newPunks <- function(CHARS) {
punks <- c("!", "\\\"", "#", "\\$", "%", "&", "'", "\\(", "\\)",
"\\*", "\\+", ",", "-", "\\.", "/", ":", ";", "<",
"=", ">", "\\?", "#", "\\[", "\\\\", "\\]", "\\^", "_",
"`", "\\{", "\\|", "\\}", "~")
keepers <- strsplit(CHARS, "")[[1]]
keepers <- ifelse(keepers %in% c("\"", "$", "{", "}", "(", ")",
"*", "+", ".", "?", "[", "]",
"^", "|", "\\"), paste0("\\", keepers), keepers)
paste(setdiff(punks, keepers), collapse="|")
}
Usage:
gsub(newPunks("#"), "", x)
# [1] "a#gstudy"
gsub(newPunks(""), "", x)
# [1] "agstudy"
gsub(newPunks("&#{"), "", x)
# [1] "a#gst{ud&y"
Bleah. Time for me to go to bed....
It works exactly the same in Perl, [:punct:] is a POSIX character class that simply maps to:
[!"#$%&'()*+,\-./:;<=>?#[\\\]^_`{|}~]
The equivalent Perl version would be:
my $x = 'a#,g:?s!*$t/{u}\d\&y';
$x =~ s/[[:punct:]]//g;
print $x;
__END__
agstudy
The straightforward approach is to use a lookahead or a lookbehind to match the same character twice, once to make sure it's a punction, and once to make sure it's not "#".
(?=[^#])[[:punct:]]
or
(?!#)[[:punct:]]
Lookahead and lookbehinds are a little expensive, though. Rather than using a lookaround at every position, it's more efficient to only use one when we find a punctuation.
[[:punct:]](?<!#)
Of course, it's even more efficient to get rid of lookarounds completely. This can be achieved through double-negation.
[^[:^punct:]#]
I haven't tested these with R, but they should at least work with perl=TRUE.

I would like to use gsub in R to match all items which are not alphanumeric

I am searching raw twitter snippets using R but keep getting issues where there are non standard Alphanumeric chars such as the following "🏄".
I would like to take out all non [abcdefghijklmnopqrstuvwxyz0123456789] characters using gsub.
Can you use gsub to specify a replace for those items NOT in [abcdefghijklmnopqrstuvwxyz0123456789]?
You could simply negate you pattern with [^ ...]:
x <- "abcde🏄fgh"
gsub("[^A-Za-z0-9]", "", x)
# [1] "abcdefgh"
Please note that the class [:alnum:] matches all your given special characters. That's why gsub("[^[:alnum:]]", "", x) doesn't work.

Using variable to create regular expression pattern in R

I have a function:
ncount <- function(num = NULL) {
toRead <- readLines("abc.txt")
n <- as.character(num)
x <- grep("{"n"} number",toRead,value=TRUE)
}
While grep-ing, I want the num passed in the function to dynamically create the pattern to be searched? How can this be done in R? The text file has number and text in every line
You could use paste to concatenate strings:
grep(paste("{", n, "} number", sep = ""),homicides,value=TRUE)
In order to build a regular expression from variables in R, in the current scenarion, you may simply concatenate string literals with your variable using paste0:
grep(paste0('\\{', n, '} number'), homicides, value=TRUE)
Note that { is a special character outside a [...] bracket expression (also called character class), and should be escaped if you need to find a literal { char.
In case you use a list of items as an alternative list, you may use a combination of paste/paste0:
words <- c('bananas', 'mangoes', 'plums')
regex <- paste0('Ben likes (', paste(words, collapse='|'), ')\\.')
The resulting Ben likes (bananas|mangoes|plums)\. regex will match Ben likes bananas., Ben likes mangoes. or Ben likes plums.. See the R demo and the regex demo.
NOTE: PCRE (when you pass perl=TRUE to base R regex functions) or ICU (stringr/stringi regex functions) have proved to better handle these scenarios, it is recommended to use those engines rather than the default TRE regex library used in base R regex functions.
Oftentimes, you will want to build a pattern with a list of words that should be matched exactly, as whole words. Here, a lot will depend on the type of boundaries and whether the words can contain special regex metacharacters or not, whether they can contain whitespace or not.
In the most general case, word boundaries (\b) work well.
regex <- paste0('\\b(', paste(words, collapse='|'), ')\\b')
unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE)))
## => [1] "bananas" "mangoes" "plums"
The \b(bananas|mangoes|plums)\b pattern will match bananas, but won't match banana (see an R demo).
If your list is like
words <- c('cm+km', 'uname\\vname')
you will have to escape the words first, i.e. append \ before each of the metacharacter:
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- c('Text: cm+km, and some uname\\vname?')
words <- c('cm+km', 'uname\\vname')
regex <- paste0('\\b(', paste(regex.escape(words), collapse='|'), ')\\b')
cat( unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE))) )
## => cm+km uname\vname
If your words can start or end with a special regex metacharacter, \b word boundaries won't work. Use
Unambiguous word boundaries, (?<!\w) / (?!\w), when the match is expected between non-word chars or start/end of string
Whitespace boundaries, (?<!\S) / (?!\S), when the match is expected to be enclosed with whitespace chars, or start/end of string
Build your own using the lookbehind/lookahead combination and your custom character class / bracket expression, or even more sophisticad patterns.
Example of the first two approaches in R (replacing with the match enclosed with << and >>):
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- 'Text: cm+km, +km and C++,Delphi,C++CLI and C++/CLI.'
words <- c('+km', 'C++')
# Unambiguous word boundaries
regex <- paste0('(?<!\\w)(', paste(regex.escape(words), collapse='|'), ')(?!\\w)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and <<C++>>,Delphi,C++CLI and <<C++>>/CLI."
# Whitespace boundaries
regex <- paste0('(?<!\\S)(', paste(regex.escape(words), collapse='|'), ')(?!\\S)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and C++,Delphi,C++CLI and C++/CLI."