Convert punctuation to space - regex

I have a bunch of strings with punctuation in them that I'd like to convert to spaces:
"This is a string. In addition, this is a string (with one more)."
would become:
"This is a string In addition this is a string with one more "
I can go thru and do this manually with the stringr package (str_replace_all()) one punctuation symbol at a time (, / . / ! / ( / ) / etc. ), but I'm curious if there's a faster way I'd assume using regex's.
Any suggestions?

x <- "This is a string. In addition, this is a string (with one more)."
gsub("[[:punct:]]", " ", x)
[1] "This is a string In addition this is a string with one more "
See ?gsub for doing quick substitutions like this, and ?regex for details on the [[:punct:]] class, i.e.
‘[:punct:]’ Punctuation characters:
‘! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { |
} ~’.

have a look at ?regex
library(stringr)
str_replace_all(x, '[[:punct:]]',' ')
"This is a string In addition this is a string with one more "

Related

VB.NET - Regex.Replace error with [ character

I want to remove some characters from a textbox. It works, but when i try to replace the "[" character it gives a error. Why?
Return Regex.Replace(html, "[", "").Replace(",", " ").Replace("]", "").Replace(Chr(34), " ")
When i delete the "[", "").Replace( part it works great?
Return Regex.Replace(html, ",", " ").Replace("]", "").Replace(Chr(34), " ")
The problem is that since the [ character has a special meaning in regex, It must be escaped in order to use it as part of a regex sequence, therefore to escape it all you have to do is add a \ before the character.
Therefore this would be your proper regex code Return Regex.Replace(html, "\[", "").Replace(",", " ").Replace("]", "").Replace(Chr(34), " ")
Because [ is a reserved character that regex patterns use. You should always escape your search patterns using Regex.Escape(). This will find all reserved characters and escape them with a backslash.
Dim searchPattern = Regex.Escape("[")
Return Regex.Replace(html, searchPattern, ""). 'etc...
But why do you need to use regex anyway? Here's a better way of doing it, I think, using StringBuilder:
Dim sb = New StringBuilder(html) _
.Replace("[", "") _
.Replace(",", " ") _
.Replace("]", "") _
.Replace(Chr(34), " ")
Return sb.ToString()

RegEx for computer name validation (cannot be more than 15 characters long, be entirely numeric, or contain the following characters...)

I have these requirements to follow:
Windows computer name cannot be more than 15 characters long, be
entirely numeric, or contain the following characters: ` ~ ! # # $ % ^
& * ( ) = + _ [ ] { } \ | ; : . ' " , < > / ?.
I want to create a RegEx to validate a given computer name.
I can see that the only permitted character is - and so far I have this:
/^[a-zA-Z0-9-]{1,15}$/
which matches almost all constraints except the "not entirely numeric" part.
How to add last constraints to my RegEx?
You could use a negative lookahead:
^(?![0-9]{1,15}$)[a-zA-Z0-9-]{1,15}$
Or simply use two regular expressions:
^[a-zA-Z0-9-]{1,15}$
AND NOT
^[0-9]{1,15}$;
Here is a live example:
var regex1 = /^(?![0-9]{1,15}$)[a-zA-Z0-9-]{1,15}$/;
var regex2 = /^[a-zA-Z0-9-]{1,15}$/;
var regex3 = /^[0-9]{1,15}$/;
var text1 = "lklndlsdsvlk323";
var text2 = "4214124";
console.log(text1 + ":", !!text1.match(regex1));
console.log(text1 + ":", text1.match(regex2) && !text1.match(regex3));
console.log(text2 + ":", !!text2.match(regex1));
console.log(text2 + ":", text2.match(regex2) && !text2.match(regex3));

Remove trailing and leading spaces and extra internal whitespace with one gsub call

I know you can remove trailing and leading spaces with
gsub("^\\s+|\\s+$", "", x)
And you can remove internal spaces with
gsub("\\s+"," ",x)
I can combine these into one function, but I was wondering if there was a way to do it with just one use of the gsub function
trim <- function (x) {
x <- gsub("^\\s+|\\s+$|", "", x)
gsub("\\s+", " ", x)
}
testString<- " This is a test. "
trim(testString)
Here is an option:
gsub("^ +| +$|( ) +", "\\1", testString) # with Frank's input, and Agstudy's style
We use a capturing group to make sure that multiple internal spaces are replaced by a single space. Change " " to \\s if you expect non-space whitespace you want to remove.
Using a positive lookbehind :
gsub("^ *|(?<= ) | *$",'',testString,perl=TRUE)
# "This is a test."
Explanation :
## "^ *" matches any leading space
## "(?<= ) " The general form is (?<=a)b :
## matches a "b"( a space here)
## that is preceded by "a" (another space here)
## " *$" matches trailing spaces
You can just add \\s+(?=\\s) to your original regex:
gsub("^\\s+|\\s+$|\\s+(?=\\s)", "", x, perl=T)
See DEMO
You've asked for a gsub option and gotten good options. There's also rm_white_multiple from "qdapRegex":
> testString<- " This is a test. "
> library(qdapRegex)
> rm_white_multiple(testString)
[1] "This is a test."
If an answer not using gsub is acceptable then the following does it. It does not use any regular expressions:
paste(scan(textConnection(testString), what = "", quiet = TRUE), collapse = " ")
giving:
[1] "This is a test."
You can also use nested gsub. Less elegant than the previous answers tho
> gsub("\\s+"," ",gsub("^\\s+|\\s$","",testString))
[1] "This is a test."

regex remove punct removes non-punctuation characters in R

While filtering and cleaning text in Hebrew, I found that
gsub("[[:punct:]]", "", txt)
actually removes a relevant character. The character is "ק" and it is located in the "E" spot on the keyboard. Interestingly, the gsub function in R removes the "ק" character and then all words get messed up. Does anyone have an idea why?
According to Regular Expressions as used in R:
Certain named classes of characters are predefined. Their
interpretation depends on the locale (see locales); the interpretation
below is that of the POSIX locale.
Acc. to POSIX locale, [[:punct:]]should capture ! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~. So, you might need to adjust your regex to remove only the characters you want:
txt <- "!\"#$%&'()*+,\\-./:;<=>?#[\\\\^\\]_`{|}~"
gsub("[\\\\!\"#$%&'()*+,./:;<=>?#[\\^\\]_`{|}~-]", "", txt, perl = T)
Sample program output:
[1] ""

Look for specific character in string and place it at different positions after a defined separator in the same string

let's define the following string s:
s <- "$ A; B; C;"
I need to translate s into:
"$ A; $B; $C;"
the semicolon is the separator. However, $ is only one of 3 special characters which can appear in the string. The data frame m holds all 3 special characters:
m <- data.frame(sp = c("$", "%", "&"))
I first used strsplit to split the string using the semicolon as the separator
> strsplit(s, ";")
[[1]]
[1] "$ A" " B" " C"
I think the next step would be to use grep or match to check if the first string contains any of the 3 special characters defined in data frame m. If so, maybe use gsub to insert the matched special character into the remaining sub strings. Then simple use paste with collapse = "" to merge the substrings together again. Does that make sense?
Cheers
What about something like this:
getmeout = gsub("[$|%|& ]", "", unlist(strsplit(s, ";")))
whatspecial = unique(gsub("[^$|%|&]", "", s))
whatspecial
# [1] "$"
getmeout
# [1] "A" "B" "C"
paste0(whatspecial, getmeout, sep=";", collapse="")
# [1] "$A;$B;$C;"
Here is one method:
library(stringr)
separator <- '; '
# extract the first part
first.part <- str_split(s, separator)[[1]][1]
first.part
# [1] "$ A"
# try to identify your special character
special <- m$sp[str_detect(first.part, as.character(m$sp))]
special
# [1] $
# Levels: $ & %
# make sure you only matched one of them
stopifnot(length(special) == 1)
# search and replace
gsub(separator, paste(separator, special, sep=""), s)
# [1] "$ A; $B; $C;"
Let me know if I missed some of your assumptions.
Back-referencing turns it into a one-liner:
s <- c( "$ A; B; C;", "& A; B; C;", "% A; B; C;" )
ms = c("$", "%", "&")
s <- gsub( paste0("([", paste(ms,collapse="") ,"]) ([A-Z]); ([A-Z]); ([A-Z]);") , "\\1 \\2; \\1 \\3; \\1 \\4" , s)
> s
[1] "$ A; $ B; $ C" "& A; & B; & C" "% A; % B; % C"
You can then make the regular expression appropriately generic (match more than one space, more than one alphanumeric character, etc.) if you need to.