R regex remove all punctuation except apostrophe [duplicate] - regex

This question already has answers here:
Remove all punctuation except apostrophes in R
(4 answers)
Closed 9 years ago.
I'm trying to remove all punctuation from a string except apostrophes. Here's my exastr2 <-
str2 <- "this doesn't not have an apostrophe,.!##$%^&*()"
gsub("[[:punct:,^\\']]"," ", str2 )
# [1] "this doesn't not have an apostrophe,.!##$%^&*()"
What am I doing wrong?

A "negative lookahead assertion" can be used to remove from consideration any apostrophes, before they are even tested for being punctuation characters.
gsub("(?!')[[:punct:]]", "", str2, perl=TRUE)
# [1] "this doesn't not have an apostrophe"

I am not sure if you can specify all punctuations except ' within a regexp the way you've done. I would check for alphanumerics + ' + space with negation:
gsub("[^'[:lower:] ]", "", str2) # per Joshua's comment
# [1] "this doesn't not have an apostrophe"

You could use:
str2 <- "this doesn't not have an apostrophe,.!##$%^&*()"
library(qdap)
strip(str2, apostrophe.remove = FALSE, lower.case = FALSE)

Related

Regex to reject only nonalphanumeric characters

If the keyword to be checked is other. It should not be preceded or followed by alphanumeric character.
spaces are allowed, \n allowed, Special characters allowed.
Not allowed - "AOther9", "noTHERX"
Allowed - "other", "\nother" , " other ", "$other/"
grepl(paste("[^a-zA-Z0-9]","other","[^a-zA-Z0-9]",sep=""),String1 , ignore.case = TRUE)
The above regex works well for all cases other than “check” - when check is preceded and followed by nothing.
You need to use a PCRE regex with lookarounds:
grepl(paste("(?<![a-zA-Z0-9])","other","(?![a-zA-Z0-9])",sep=""), String1, ignore.case = TRUE, perl=TRUE)
^^^^ ^ ^^^ ^ ^^^^^^^^^
The negative lookarounds will not consume the non-alphanumeric characters, they do not require those characters to actually be present in the string.
You can read more about lookarounds here.
Add a * quantifier to the inverted ranges, and start ^ and end $ of line anchors:
String1 <- c("AOther9", "noTHERX", "other", "\nother", " other ", "$other/")
grep('^[^a-z0-9]*other[^a-z0-9]*$', String1, ignore.case = TRUE, value = TRUE)
# [1] "other" "\nother" " other " "$other/"

Remove trailing and leading spaces and extra internal whitespace with one gsub call

I know you can remove trailing and leading spaces with
gsub("^\\s+|\\s+$", "", x)
And you can remove internal spaces with
gsub("\\s+"," ",x)
I can combine these into one function, but I was wondering if there was a way to do it with just one use of the gsub function
trim <- function (x) {
x <- gsub("^\\s+|\\s+$|", "", x)
gsub("\\s+", " ", x)
}
testString<- " This is a test. "
trim(testString)
Here is an option:
gsub("^ +| +$|( ) +", "\\1", testString) # with Frank's input, and Agstudy's style
We use a capturing group to make sure that multiple internal spaces are replaced by a single space. Change " " to \\s if you expect non-space whitespace you want to remove.
Using a positive lookbehind :
gsub("^ *|(?<= ) | *$",'',testString,perl=TRUE)
# "This is a test."
Explanation :
## "^ *" matches any leading space
## "(?<= ) " The general form is (?<=a)b :
## matches a "b"( a space here)
## that is preceded by "a" (another space here)
## " *$" matches trailing spaces
You can just add \\s+(?=\\s) to your original regex:
gsub("^\\s+|\\s+$|\\s+(?=\\s)", "", x, perl=T)
See DEMO
You've asked for a gsub option and gotten good options. There's also rm_white_multiple from "qdapRegex":
> testString<- " This is a test. "
> library(qdapRegex)
> rm_white_multiple(testString)
[1] "This is a test."
If an answer not using gsub is acceptable then the following does it. It does not use any regular expressions:
paste(scan(textConnection(testString), what = "", quiet = TRUE), collapse = " ")
giving:
[1] "This is a test."
You can also use nested gsub. Less elegant than the previous answers tho
> gsub("\\s+"," ",gsub("^\\s+|\\s$","",testString))
[1] "This is a test."

strsplit inconsistent with gregexpr

A comment on my answer to this question which should give the desired result using strsplit does not, even though it seems to correctly match the first and last commas in a character vector. This can be proved using gregexpr and regmatches.
So why does strsplit split on each comma in this example, even though regmatches only returns two matches for the same regex?
# We would like to split on the first comma and
# the last comma (positions 4 and 13 in this string)
x <- "123,34,56,78,90"
# Splits on every comma. Must be wrong.
strsplit( x , '^\\w+\\K,|,(?=\\w+$)' , perl = TRUE )[[1]]
#[1] "123" "34" "56" "78" "90"
# Ok. Let's check the positions of matches for this regex
m <- gregexpr( '^\\w+\\K,|,(?=\\w+$)' , x , perl = TRUE )
# Matching positions are at
unlist(m)
[1] 4 13
# And extracting them...
regmatches( x , m )
[[1]]
[1] "," ","
Huh?! What is going on?
The theory of #Aprillion is exact, from R documentation:
The algorithm applied to each input string is
repeat {
if the string is empty
break.
if there is a match
add the string to the left of the match to the output.
remove the match and all to the left of it.
else
add the string to the output.
break.
}
In other words, at each iteration ^ will match the begining of a new string (without the precedent items.)
To simply illustrate this behavior:
> x <- "12345"
> strsplit( x , "^." , perl = TRUE )
[[1]]
[1] "" "" "" "" ""
Here, you can see the consequence of this behavior with a lookahead assertion as delimiter (Thanks to #JoshO'Brien for the link.)

Removing multiple commas and trailing commas using gsub

This question is very similar to Removing multiple spaces and trailing spaces using gsub, except that I'd like to apply it to commas instead of spaces.
For example, I'd like a function TrimCommas to turn x into y:
x <- c("a,b,c", ",a,b,,c", ",,,a,,,b,c,,,")
# y <- TrimCommas(x) # presumably
y <- c("a,b,c", "a,b,c", "a,b,c")
The solution for spaces was gsub("^ *|(?<= ) | *$", "", x, perl=T), so I'm hoping comparing the solution for this will help explain some regex fundamentals as well.
Isn't the solution pretty similar?
x <- c("a,b,c", ",a,b,,c", ",,,a,,,b,c,,,")
gsub("^,*|(?<=,),|,*$", "", x, perl=T)
# [1] "a,b,c" "a,b,c" "a,b,c"
There are three parts to the regex ^,*|(?<=,),|,*$:
^,* -- this matches 0 or more commas at the beginning of the string
(?<=,), -- this is a positive lookbehind to see if there a comma behind a comma, so it matches , in ,,
,*$ -- this matches 0 or more commas at the end of the string
As you can see all of the above are substituted with nothing.
You can make this generic to any character (" ", ",", etc.) with this function:
TrimMult <- function(x, char=" ") {
return(gsub(paste0("^", char, "*|(?<=", char, ")", char, "|", char, "*$"),
"", x, perl=T))
}

R : regular expression for 'not followed by' not working

I needed to retain the words enclosed in brackets and delete the others in the following string.
(a(b(c)d)(e)f)
So what I expected would be (((c))(e)).
To delete a, b, d, f, I tried the 'not followed by' regex.
str <- "(a(b(c)d)(e)f)"
gsub("([a-z]+)(?!\\))", "", str) #(sub. anything that isn't followed by a ")" )
The message shows my regex in invalid. As I can see, the brackets in the second part of the regex "(?!\))" don't match properly. As for my editor, the first "(" matches with the immediately following ")", which is not meant to be a closure bracket (the one to its right is). I could make out just this error from my regex. Can you please tell me what actually is wrong? Is there any other way to do this?
In two steps, and using positive lookaheads:
str1 <- gsub("\\([a-z](?=\\()", "\\(", str, perl=TRUE)
str1
# [1] "(((c)d)(e)f)"
str2 <- gsub("\\)[a-z](?=\\))", "\\)", str1, perl=TRUE)
str2
# [1] "(((c))(e))"
Edit: it turns out you can even do it in one:
gsub("([\\(\\)])[a-z](?=\\1)", "\\1", str, perl=TRUE)
# [1] "(((c))(e))"
I agree with #Dason's comment:
st <- "(a(b(c)d)(e)f)"
while(grepl("\\([a-z]+\\(",st)) {
st <- sub("\\([a-z]+(\\(.+\\))[a-z]+\\)","\\1",st)
}
> st
[1] "(c)(e)"
Written on my iPad :-)