Remove all special characters from a string in R? - regex

How to remove all special characters from string in R and replace them with spaces ?
Some special characters to remove are : ~!##$%^&*(){}_+:"<>?,./;'[]-=
I've tried regex with [:punct:] pattern but it removes only punctuation marks.
Question 2 : And how to remove characters from foreign languages like : â í ü Â á ą ę ś ć ?
Answer : Use [^[:alnum:]] to remove~!##$%^&*(){}_+:"<>?,./;'[]-= and use [^a-zA-Z0-9] to remove also â í ü Â á ą ę ś ć in regex or regexpr functions.
Solution in base R :
x <- "a1~!##$%^&*(){}_+:\"<>?,./;'[]-="
gsub("[[:punct:]]", "", x) # no libraries needed

You need to use regular expressions to identify the unwanted characters. For the most easily readable code, you want the str_replace_all from the stringr package, though gsub from base R works just as well.
The exact regular expression depends upon what you are trying to do. You could just remove those specific characters that you gave in the question, but it's much easier to remove all punctuation characters.
x <- "a1~!##$%^&*(){}_+:\"<>?,./;'[]-=" #or whatever
str_replace_all(x, "[[:punct:]]", " ")
(The base R equivalent is gsub("[[:punct:]]", " ", x).)
An alternative is to swap out all non-alphanumeric characters.
str_replace_all(x, "[^[:alnum:]]", " ")
Note that the definition of what constitutes a letter or a number or a punctuatution mark varies slightly depending upon your locale, so you may need to experiment a little to get exactly what you want.

Instead of using regex to remove those "crazy" characters, just convert them to ASCII, which will remove accents, but will keep the letters.
astr <- "Ábcdêãçoàúü"
iconv(astr, from = 'UTF-8', to = 'ASCII//TRANSLIT')
which results in
[1] "Abcdeacoauu"

Convert the Special characters to apostrophe,
Data <- gsub("[^0-9A-Za-z///' ]","'" , Data ,ignore.case = TRUE)
Below code it to remove extra ''' apostrophe
Data <- gsub("''","" , Data ,ignore.case = TRUE)
Use gsub(..) function for replacing the special character with apostrophe

Related

regex for specific pattern with special characters

I have the following in a data.frame in r:
example <- "Inmuebles24_|.|_Casa_|.|_Renta_|.|_NuevoLeon"
I would like to simply use stringr count and some basic grexpr functions on the string, but i'm stuck on the regex.
The delimiter is clearly (and confusingly): _|.|_
How would this be expressed with regex?
Currently trying to escape everything to no success:
str_count(string = example, pattern = "[\\_\\|\\.\\|\\_]")
Your regex does not work because you placed it into a character class (where you do not need to escape _, BTW). See my today's answer to Regex expression not working with once or none for an explanation of the issue (mainly, the characters are treated as separate symbols and not as sequences of symbols, and all the special symbols are treated as literals, too).
You can achieve what you want in two steps:
Trim the string from the delimiters with gsub
Use str_count + 1 to get the count (as the number of parts = number of delimiters inside the string + 1)
R code:
example <- "_|.|_Inmuebles24_|.|_Casa_|.|_Renta_|.|_NuevoLeon_|.|_"
str_count(string = gsub("^(_[|][.][|]_)+|(_[|][.][|]_)+$", "", example), pattern = "_\\|\\.\\|_") + 1
## => 4
Or, in case you have multile consecutive delimiters, you need another intermediate step to "contract" them into 1:
example <- "_|.|_Inmuebles24_|.|_Casa_|.|__|.|_Renta_|.|__|.|_NuevoLeon_|.|_"
example <- gsub("((_[|][.][|]_)+)", "_|.|_", example)
str_count(string = gsub("^(_[|][.][|]_)+|(_[|][.][|]_)+$", "", example), pattern = "_\\|\\.\\|_") + 1
## => 4
Notes on the regexps: _[|][.][|]_ matches _|.|_ literally as symbols in the [...] character classes lose their special meaning. ((_[|][.][|]_)+) (2) matches 1 or more (+) sequences of these delimiters. The ^(_[|][.][|]_)+|(_[|][.][|]_)+$ pattern matches 1 or more delimiters at the start (^) and end ($) of the string.
This gives you what you want for this specific example you've given: str_count(example, "\\w+")

How to replace a symbol by a backslash in R?

Could you help me to replace a char by a backslash in R? My trial:
gsub("D","\\","1D2")
Thanks in advance
You need to re-escape the backslash because it needs to be escaped once as part of a normal R string (hence '\\' instead of '\'), and in addition it’s handled differently by gsub in a replacement pattern, so it needs to be escaped again. The following works:
gsub('D', '\\\\', '1D2')
# "1\\2"
The reason the result looks different from the desired output is that R doesn’t actually print the result, it prints an interpretable R string (note the surrounding quotation marks!). But if you use cat or message it’s printed correctly:
cat(gsub('D', '\\\\', '1D2'), '\n')
# 1\2
When inputting backslashes from the keyboard, always escape them:
gsub("D","\\\\","1D2")
#[1] "1\\2"
or,
gsub("D","\\","1D2", fixed=TRUE)
#[1] "1\\2"
or,
library(stringr)
str_replace("1D2","D","\\\\")
#[1] "1\\2"
Note: If you want something like "1\2" as output, I'm afraid you can't do that in R (at least in my knowledge). You can use forward slashes in path names to avoid this.
For more information, refer to this issue raised in R help: How to replace double backslash with single backslash in R.
gsub("\\p{S}", "\\\\", text, perl=TRUE);
\p{S} ... Match a character from the Unicode category symbol.

Replace inside matched string with Notepad++ and regex

I have some lines in a text file :
Joëlle;Dupont;123456
Alex;Léger;134234
And I want to replace them by :
Joëlle;Dupont;123456;joelle.dupont#mail.com
Alex;Léger;134234;alex.leger#mail.com
I want to replace all characters with accents (é, ë…) by characters without accents (e, e…) but only on the mail adress, only on a part of the line.
I know I can use \L\E to change uppercase letter into lowercase letter but it's not the only thing I have to do.
I used :
(.*?);(.*?);(\d*?)\n
To replace it by :
$1;$2;$3;\L$1.$2#mail.com\E\n
But it wouldn't replace characters with accents :
Joëlle;Dupont;123456;joëlle.dupont#mail.com
Alex;Léger;134234;alex.léger#mail.com
If you have any idea how I could do this with Notepad++, even with more than one replacement, maybe you can help me.
I don't know your whole population, but you could use the below to replace the variations of e with an e:
[\xE8-\xEB](?!.*;)
And replace with e.
[I got the range above from this webpage, taking the column names]
regex101 demo
This regex matches any è, é, ê or ë and replaces them with an e, if there is no ; on the same line after it.
For variations of o:
[\xF2-\xF6](?!.*;)
For c (there's only one, so you can also put in ç directly):
\xE7(?!.*;)
For a:
[\xE0-\xE5](?!.*;)

I would like to use gsub in R to match all items which are not alphanumeric

I am searching raw twitter snippets using R but keep getting issues where there are non standard Alphanumeric chars such as the following "🏄".
I would like to take out all non [abcdefghijklmnopqrstuvwxyz0123456789] characters using gsub.
Can you use gsub to specify a replace for those items NOT in [abcdefghijklmnopqrstuvwxyz0123456789]?
You could simply negate you pattern with [^ ...]:
x <- "abcde🏄fgh"
gsub("[^A-Za-z0-9]", "", x)
# [1] "abcdefgh"
Please note that the class [:alnum:] matches all your given special characters. That's why gsub("[^[:alnum:]]", "", x) doesn't work.

Replacing non alphabetical characters & numbers with other special characters

I am using the following code to take out anything other than alphabetical characters, numbers, question mark, exclamation point, periods, parenthesis, commas & hyphen:
MsgBox(System.Text.RegularExpressions.Regex.Replace("hello to you's! My # is (442) 523-5584. #$%^*<>{}[]\|/?,+-=:;`~", "[^A-Za-z0-9]", ""))
I come up with this: hellotoyousMy#is4425235584
It should read like so: hello to yous! My # is (442) 523-5584.?,
Simply add all characters to your negated character class (take note of the space character!):
MsgBox(System.Text.RegularExpressions.Regex.Replace("hello to you's! My # is (442) 523-5584. #$%^*<>{}[]\|/?,+-=:;`~", "[^A-Za-z0-9 ?!.(),#-]+", ""))
(I also added a repeating + to your regex, so it can replace consecutive disallowed characters in one go)
Add a space and other symbols in the regex:
MsgBox(System.Text.RegularExpressions.Regex.Replace("hello to you's! My # is (442) 523-5584. #$%^*<>{}[]\|/?,+-=:;`~", "[^A-Za-z0-9 \(\)\!\.,\-\?]", ""))
Regex.Replace("your text", "[^A-Za-z0-9 ?!.(),-]+", "")
It [^A-Za-z0-9 ?!.(),-]+ will grab all unwanted characters following one after another and replace them by ""