Replace special characters (dash) - regex

I was attempting to replace what I thought was a standard dash using gsub. The code I was testing was:
gsub("-", "ABC", "reported – estimate")
This does nothing, though. I copied and pasted the dash into http://unicodelookup.com/#–/1 and it seems to be a en dash. That site provides the hex, dec etc codes for an en dash and I've been trying to replace the en dash but am not having luck. Suggestions?
(As a bonus, if you can tell me if there is a function to identify special characters that would be helpful).
I'm not sure if SO's code formatting will change the dash format so here is the dash I'm using (–).

You can replace the en-dash by just specifying it in the regex pattern.
gsub("–", "ABC", "reported – estimate")
You can match all hyphens, en- and em-dashes with
gsub("[-–—]", "ABC", "reported – estimate — more - text")
See IDEONE demo
To check if there are non-ascii characters in a string, use
> s = "plus ça change, plus c'est la même chose"
> gsub("[[:ascii:]]+", "", s, perl=T)
[1] "çê"
See this IDEONE demo
You will either get an empty result (if a string only consists of "word" characters and whitespace), or - as here - some "special" characters.

for special character replacement you can do a negative complement.
gsub('[^\\w]*', 'ABC', 'reported - estimate', perl = True) will replace all special characters with ABC. The [^\w] is a pattern that says anything that isn't a normal character.

Related

In Elixir, how can I split a string with non-word characters are separators, but also allow for math operators like +, -, etc.?

In Elixir, I would like to split a string, treating all the non-word characters as separators, including the "Ogham Space Mark ( )" (which should not be confused for a minus (-) sign).
So, if I split the string:
"1\x002\x013\n4\r5 6\t7 + asda - 3434"
The result should be:
["1","2","3","4","5","6","7","+","asda","-","3434"]
I'm trying to figure out how to do this with Regex, but the best I've been able to accomplish so far is:
Regex.split(~r/[\W| ]+/, input_string)
.... but this drops the + and - sign as these are not considered word characters.
or
Regex.split(~r/[^[:punct:]|^[:alnum:]| ]+/, input_string)
but this fails to split on the Ogham Space Mark.
This will actually work correctly, but it is inelegant for the extra transformation:
Regex.split(~r/[^[:punct:]|^[:alnum:]]+/, String.replace(input_string, " ", " "))
Is there any way to split this with a single Regex invocation?
Elixir regular expressions are handled by the PCRE regex engine, and your input string contains characters from the whole Unicode character table, not just the ASCII part.
You may enable Unicode mode with the help of two PCRE verbs, (*UTF)(*UCP):
Regex.split(~r/(*UTF)(*UCP)[^\w\/*+-]+/, "1\x002\x013\n4\r5 6\t7 + asda - 3434")
It will output:
["1", "2", "3", "4", "5", "6", "7", "+", "asda", "-", "3434"]
See the Elixir demo online.
NOTE: ~r/[^\w\/*+-]+/u and ~r/(*UTF)(*UCP)[^\w\/*+-]+/ are equivalent, u is a shorthand for the two PCRE verbs.
The regex matches
(*UTF)(*UCP) - (*UTF) treats the input string as a Unicode code point sequence and (*UCP) makes the \w Unicode aware (so that is matches [\p{L}\p{N}_] characters)
[^\w\/*+-]+ - 1 or more characters other than letters, digits, /, *, + and -.
Note that - in the meaning of a literal - char does not have to be escaped when placed at the end of the character class.

Ruby - split on nonalphanumeric characters excluding international characters?

This is my regex so far which will split on non-alphanumeric characters, including international characters (ie Korean, Japanese, Chinese characters).
title = '[MV] SUNMI(선미) _ 누아르(Noir)'
title.split(/[^a-zA-Z0-9 ']/)
this is the regex to match any international character:
[^\x00-\x7F]+
Which I got from: Regular expression to match non-English characters? Let'a ssume this is 100% correct (no debating!)
How do I combine these 2 so I can split on non-alphanumeric characters, excluding international characters? The easy part is done. I just need to combine these regex's somehow.
My expected output would be something like this
["MV", "SUNMI", "선미", "누아르", "Noir"]
TLDR: I want to split on non-alphanumeric characters only (english letters, foreign characters should not be split on)
(?:[^a-zA-Z0-9](?<![^\x00-\x7F]))+
https://regex101.com/r/EDyluc/1
What is not matched (remains from split) is what you want to keep.
Explained:
(?:
[^a-zA-Z0-9] # Not Ascii AlphaNum
(?<! [^\x00-\x7F] ) # Behind, not not Ascii range (Ascii boundary)
)+
Let me know if you need a more detailed explanation.
So basically you want to split on all ascii but non-alphabet characters. You can use this regex which selects all characters within ascii range.
[ -#[-`{-~]+
This regex having ranges space to # then ignoring all uppercase letters then picks all characters from [ to backtick then ignores all lowercase letters then picks all characters from { to ~ as can be seen in ascii table.
In case you want to exclude till extended ascii characters, you can change ~ in regex with ÿ and use [ -#[-{-ÿ]+` regex.
Demo
Check out these Ruby codes,
s = '[MV] SUNMI(선미) _ 누아르(Noir)'
puts s.split(/[ -#\[-`{-~]+/)
Prints,
MV
SUNMI
선미
누아르
Noir
Online Ruby Demo

Remove letters matching pattern before and after the required string

I have a vector with the following elements:
myvec<- c("output.chr10.recalibrated", "output.chr11.recalibrated",
"output.chrY.recalibrated")
I want to selectively extract the value after chr and before .recalibrated and get the result.
Result:
10, 11, Y
You can do that with a mere sub:
> sub(".*?chr(.*?)\\.recalibrated.*", "\\1", myvec)
[1] "10" "11" "Y"
The pattern matches any symbols before the first chr, then matches and captures any characters up to the first .recalibrated, and then matches the rest of the characters. In the replacement pattern, we use a backreference \1 that inserts the captured value you need back into the resulting string.
See the regex demo
As an alternative, use str_match:
> library(stringr)
> str_match(myvec, "chr(.*?)\\.recalibrated")[,2]
[1] "10" "11" "Y"
It keeps all captured values and helps avoid costly unanchored lookarounds in the pattern that are necessary in str_extract.
The pattern means:
chr - match a sequence of literal characters chr
(.*?) - match any characters other than a newline (if you need to match newlines, too, add (?s) at the beginning of the pattern) up to the first
\\.recalibrated - .recalibrated literal character sequence.
Both answers failing in case of slightly different inputs like whatever.chr10.whateverelse.recalibrated here's my own approach only differing on the regex part with sub:
sub(".*[.]chr([^.]*)[.].*", "\\1", myvec)
what the regex does is:
.*[.]chr match as much as possible until finding '.chr' literraly
([^.]*) capture everything not a dot after chr (could be replaced by \\d+ to capture only numeric values, requiring at least one digit present
[.].* match the rest of the line after a literal dot
I prefer the character class escape of dots ([.]) on the backslash escape (\\.) as it's usually easier to read when you're back on the regex, that's my my opinion and not covered by any best practice I know of.
We can use str_extract to do this. We match one of more characters (.*) that follow 'chr' ((?<=chr)) and before the .recalibrated ((?=\\.recalibrated)).
library(stringr)
str_extract(myvec, "(?<=chr).*(?=\\.recalibrated)")
#[1] "10" "11" "Y"
Or use gsub to match the characters until chr or (|) that starts from .recalibrated to the end ($) of the string and replace it with ''.
gsub(".*\\.chr|\\.recalibrated.*$", "", myvec)
#[1] "10" "11" "Y"
Looks like XY problem. Why extract? If this is needed in further analysis steps, we could for example do this instead:
for(chrN in c(1:22, "X", "Y")) {
myVar <- paste0("output.chr", chrN, ".recalibrated")
#do some fun stuff with myVar
print(myVar)
}

What's a good regex to include accented characters in a simple way?

Right now my regex is something like this:
[a-zA-Z0-9] but it does not include accented characters like I would want to. I would also like - ' , to be included.
Accented Characters: DIY Character Range Subtraction
If your regex engine allows it (and many will), this will work:
(?i)^(?:(?![×Þß÷þø])[-'0-9a-zÀ-ÿ])+$
Please see the demo (you can add characters to test).
Explanation
(?i) sets case-insensitive mode
The ^ anchor asserts that we are at the beginning of the string
(?:(?![×Þß÷þø])[-'0-9a-zÀ-ÿ]) matches one character...
The lookahead (?![×Þß÷þø]) asserts that the char is not one of those in the brackets
[-'0-9a-zÀ-ÿ] allows dash, apostrophe, digits, letters, and chars in a wide accented range, from which we need to subtract
The + matches that one or more times
The $ anchor asserts that we are at the end of the string
Reference
Extended ASCII Table
You put in your expression:
\p{L}\p{M}
This in Unicode will match:
any letter character (L) from any language
and marks (M)(i.e, a character that is to be combined with another: accent, etc.)
A version without the exclusion rules:
^[-'a-zA-ZÀ-ÖØ-öø-ÿ]+$
Explanation
The ^ anchor asserts that we are at the beginning of the string
[...] allows dash, apostrophe,
digits, letters, and chars in a wide accented range,
The + matches that one or more times
The $ anchor asserts that we are at the end of the string
Reference
Extended ASCII Table
Use a POSIX character class (http://www.regular-expressions.info/posixbrackets.html):
[-'[:alpha:]0-9] or [-'[:alnum:]]
The [:alpha:] character class matches whatever is considered "alphabetic characters" in your locale.
#NightCoder's answer works perfectly:
\p{L}\p{M}
and with no brittle whitelists. Note that to get it working in javascript you need to add the unicode u flag. Useful to have a working example in javascript...
[..."Crêpes are øh-so déclassée".matchAll( /[-'’\p{L}\p{M}\p{N}]+/giu )]
will return something like...
[
{
"0": "Crêpes",
"index": 0
},
{
"0": "are",
"index": 7
},
{
"0": "øh-so",
"index": 11
},
{
"0": "déclassée",
"index": 17
}
]
Here it is in a playground... https://regex101.com/r/ifgH4H/1/
And also some detail on those regex unicode categories... https://javascript.info/regexp-unicode

Regex, every non-alphanumeric character except white space or colon

How can I do this one anywhere?
Basically, I am trying to match all kinds of miscellaneous characters such as ampersands, semicolons, dollar signs, etc.
[^a-zA-Z\d\s:]
\d - numeric class
\s - whitespace
a-zA-Z - matches all the letters
^ - negates them all - so you get - non numeric chars, non spaces and non colons
This should do it:
[^a-zA-Z\d\s:]
If you want to treat accented latin characters (eg. à Ñ) as normal letters (ie. avoid matching them too), you'll also need to include the appropriate Unicode range (\u00C0-\u00FF) in your regex, so it would look like this:
/[^a-zA-Z\d\s:\u00C0-\u00FF]/g
^ negates what follows
a-zA-Z matches upper and lower case letters
\d matches digits
\s matches white space (if you only want to match spaces, replace this with a space)
: matches a colon
\u00C0-\u00FF matches the Unicode range for accented latin characters.
nb. Unicode range matching might not work for all regex engines, but the above certainly works in Javascript (as seen in this pen on Codepen).
nb2. If you're not bothered about matching underscores, you could replace a-zA-Z\d with \w, which matches letters, digits, and underscores.
Try this:
[^a-zA-Z0-9 :]
JavaScript example:
"!##$%* ABC def:123".replace(/[^a-zA-Z0-9 :]/g, ".")
See a online example:
http://jsfiddle.net/vhMy8/
In JavaScript:
/[^\w_]/g
^ negation, i.e. select anything not in the following set
\w any word character (i.e. any alphanumeric character, plus underscore)
_ negate the underscore, as it's considered a 'word' character
Usage example - const nonAlphaNumericChars = /[^\w_]/g;
No alphanumeric, white space or '_'.
var reg = /[^\w\s)]|[_]/g;
If you mean "non-alphanumeric characters", try to use this:
var reg =/[^a-zA-Z0-9]/g //[^abc]
This regex works for C#, PCRE and Go to name a few.
It doesn't work for JavaScript on Chrome from what RegexBuddy says. But there's already an example for that here.
This main part of this is:
\p{L}
which represents \p{L} or \p{Letter} any kind of letter from any language.`
The full regex itself: [^\w\d\s:\p{L}]
Example: https://regex101.com/r/K59PrA/2
Previous solutions only seem reasonable for English or other Latin-based languages without accents, etc. Those answers are for that reason not generalised to answer the question.
According to the Whitespace character article on Wikipedia, these are all the whitespace characters in Unicode:
U+0009, U+000A, U+000B, U+000C, U+000D, U+0020, U+0085, U+00A0, U+1680, U+180E, U+2000, U+2001, U+2002, U+2003, U+2004, U+2005, U+2006, U+2007, U+2008, U+2009, U+200A, U+200B, U+200C, U+200D, U+2028, U+2029, U+202F, U+205F, U+2060, U+3000, U+FEFF
So in my opinion, the most inclusive solution would be (might be slow, but this is about accuracy):
\u0009\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u200B\u200C\u200D\u2028\u2029\u202F\u205F\u2060\u3000\uFEFF
Thus, to answer OP's question to include "every non-alphanumeric character except white space or colon", prepend a hat ^ to not include above characters and add the colon to that, and surround the regex in [ and ] to instruct it to 'any of these characters':
"[^:\u0009\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u200B\u200C\u200D\u2028\u2029\u202F\u205F\u2060\u3000\uFEFF]"
Debuggex Demo
Bonus: solution for R
trimws2 <- function(..., whitespace = "[\u0009\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u200B\u200C\u200D\u2028\u2029\u202F\u205F\u2060\u3000\uFEFF]") {
trimws(..., whitespace = whitespace)
}
This is even faster than trimws() itself which sets " \t\n\r".
microbenchmark::microbenchmark(trimws2(" \t\r\n"), trimws(" \t\r\n"))
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> trimws2(" \\t\\r\\n") 29.177 29.875 31.94345 30.4990 31.3895 105.642 100 a
#> trimws(" \\t\\r\\n") 45.811 46.630 48.25076 47.2545 48.2765 116.571 100 b
Try to add this:
^[^a-zA-Z\d\s:]*$
This has worked for me... :)
[^\w\s-]
Character set of characters which not:
Alphanumeric
Whitespace
Colon