How to escape special regex symbols from string - regex

I am using following regex to match a value in csv string.
for example: csvString = "abc-d, xy%z, efgh, ijklm, nopq(1)rst, uvw#xy"
valueString = "xy%z";
var regExp = new RegExp('(^|, )' + valueString + '(,|$)');
csvString.replace(regExp, "")
Above regExp is works well for any value in csvString except for value 'nopq(1)rst'. It fails when the valueString contains '()' brackets, for example valueString = "nopq(1)rst";. I want the regular expression to match whatever the valueString contains.
How to escape special regex symbols like '(' ')' '[' ']' '\' etc from string

Use the escapeRegex method. It escapes any special characters.

You need to quote your input string to escape all the regex special characters:
var regexpSpecialChars = /([\[\]\^\$\|\(\)\\\+\*\?\{\}\=\!])/g;
var regExp = new RegExp('(^|, )' + valueString.replace(regexpSpecialChars,'\\$1') + '(,|$)');
Reason why nopq(1)rst is failing to match because ( and ) are special regex symbols that are used for grouping, effectively making your regex as:
(^|, )nopq1rst(,|$)

Related

Capturing a delimiter that isn't in between single quotes

Like the question says, is it possible to use a single Regex string to get a delimiter that isn't in between some quotes?
For example, I want to split this string with the delimiter &:
"example=3&testing='f&tmp'"
should produce
["example=3", "testing='f&tmp'"]
Essentially, things inside single quotes (' ') should remain untouched.
I found out how to get things within quotes with expression: (?:'.*?')
The closest I could get to a tangible solution was: (.[^']&[^'])
It is not an easy task for a String#split, but is quite a feasible task for Matcher#find if you use
[^&\s=]+=(?:'[^']*'|[^\s&]*)
(see this regex demo) and this Java code:
String text = "example=3&testing='f&tmp'";
Pattern p = Pattern.compile("[^&\\s=]+=(?:'[^']*'|[^\\s&]*)");
Matcher m = p.matcher(text);
List<String> res = new ArrayList<>();
while(m.find()) {
res.add(m.group());
}
System.out.println(res);
// => [example=3, testing='f&tmp']
Details
[^&\s=]+ - one or more chars other than &, = and whitespace
= - a = char
(?:'[^']*'|[^\s&]*) - a non-capturing group matching either ', zero or more chars other than ' and then a ', or zero or more chars other than whitespace and &.

Character class [:punct:] doesn't seem to work correctly

This is in the nodejs repl loop.
> let re = new RegExp('[:punct:]*lipsticks[:punct:]*', 'i');
/[:punct:]*lipsticks[:punct:]*/i
> 'LipsticksGuava'.replace(re, '')
'Guava'
> 'LipsticksNaked'.replace(re, '')
'aked'
What happened to the N?
Revised my experiment based on feedback.
> re = new RegExp('[:punct:]*lipsticks[:punct:]*', 'i');
/[:punct:]*lipsticks[:punct:]*/i
> 'LipsticksNaked'.replace(re, '')
'aked'
> re = new RegExp('[[:punct:]]*lipsticks[[:punct:]]*', 'i');
/[[:punct:]]*lipsticks[[:punct:]]*/i
> 'LipsticksNaked'.replace(re, '')
'LipsticksNaked'
>
JS flavor
The JavaScript flavor does not support [:POSIX CHARACTER CLASS:]s.
What is going on?
The /[:punct:]*lipsticks[:punct:]/gi regex matches
[:punct:]* - (an NFA character class) zero or more (due to *) characters from the set: :, p, u, n, c or t in a case insensitive way (it matches an empty space before LipsticksNaked in your case)
lipsticks - a literal string lipsticks, case insensitively
[:punct:]* - (see explanation above) this part matches N since the letter is on the list inside the character class.
What does it happen if we try to use a POSIX character class in a bracket expression in JS as [[:punct:]]?
This [[:punct:]]pattern is actually a sequence of 2 subpatterns:
[[:punct:] - a character class matching [, :, p, u, n, c, t characters
]* - zero or more closing square brackets
Thus, this whole `` pattern successfully matches :LipsticksN in :LipsticksNaked.
Any solution?
To match punctuation you may use XRegExp \p{P}:
var str = ".;-LipstickNaked";
regex = XRegExp('\\p{P}*lipstick\\p{P}*', 'ig');
var replaced = XRegExp.replace(str, regex, "");
console.log(replaced);
// or if you cannot use XRegExp
var pP_block = "(?:[\\x21-\\x23\\x25-\\x2A\\x2C-\\x2F\\x3A\\x3B\\x3F\\x40\\x5B-\\x5D\\x5F\\x7B\\x7D\\xA1\\xA7\\xAB\\xB6\\xB7\\xBB\\xBF\\u037E\\u0387\\u055A-\\u055F\\u0589\\u058A\\u05BE\\u05C0\\u05C3\\u05C6\\u05F3\\u05F4\\u0609\\u060A\\u060C\\u060D\\u061B\\u061E\\u061F\\u066A-\\u066D\\u06D4\\u0700-\\u070D\\u07F7-\\u07F9\\u0830-\\u083E\\u085E\\u0964\\u0965\\u0970\\u0AF0\\u0DF4\\u0E4F\\u0E5A\\u0E5B\\u0F04-\\u0F12\\u0F14\\u0F3A-\\u0F3D\\u0F85\\u0FD0-\\u0FD4\\u0FD9\\u0FDA\\u104A-\\u104F\\u10FB\\u1360-\\u1368\\u1400\\u166D\\u166E\\u169B\\u169C\\u16EB-\\u16ED\\u1735\\u1736\\u17D4-\\u17D6\\u17D8-\\u17DA\\u1800-\\u180A\\u1944\\u1945\\u1A1E\\u1A1F\\u1AA0-\\u1AA6\\u1AA8-\\u1AAD\\u1B5A-\\u1B60\\u1BFC-\\u1BFF\\u1C3B-\\u1C3F\\u1C7E\\u1C7F\\u1CC0-\\u1CC7\\u1CD3\\u2010-\\u2027\\u2030-\\u2043\\u2045-\\u2051\\u2053-\\u205E\\u207D\\u207E\\u208D\\u208E\\u2308-\\u230B\\u2329\\u232A\\u2768-\\u2775\\u27C5\\u27C6\\u27E6-\\u27EF\\u2983-\\u2998\\u29D8-\\u29DB\\u29FC\\u29FD\\u2CF9-\\u2CFC\\u2CFE\\u2CFF\\u2D70\\u2E00-\\u2E2E\\u2E30-\\u2E42\\u3001-\\u3003\\u3008-\\u3011\\u3014-\\u301F\\u3030\\u303D\\u30A0\\u30FB\\uA4FE\\uA4FF\\uA60D-\\uA60F\\uA673\\uA67E\\uA6F2-\\uA6F7\\uA874-\\uA877\\uA8CE\\uA8CF\\uA8F8-\\uA8FA\\uA8FC\\uA92E\\uA92F\\uA95F\\uA9C1-\\uA9CD\\uA9DE\\uA9DF\\uAA5C-\\uAA5F\\uAADE\\uAADF\\uAAF0\\uAAF1\\uABEB\\uFD3E\\uFD3F\\uFE10-\\uFE19\\uFE30-\\uFE52\\uFE54-\\uFE61\\uFE63\\uFE68\\uFE6A\\uFE6B\\uFF01-\\uFF03\\uFF05-\\uFF0A\\uFF0C-\\uFF0F\\uFF1A\\uFF1B\\uFF1F\\uFF20\\uFF3B-\\uFF3D\\uFF3F\\uFF5B\\uFF5D\\uFF5F-\\uFF65]|\\uD802[\\uDC57\\uDD1F\\uDD3F\\uDE50-\\uDE58\\uDE7F\\uDEF0-\\uDEF6\\uDF39-\\uDF3F\\uDF99-\\uDF9C]|\\uD809[\\uDC70-\\uDC74]|\\uD805[\\uDCC6\\uDDC1-\\uDDD7\\uDE41-\\uDE43\\uDF3C-\\uDF3E]|\\uD836[\\uDE87-\\uDE8B]|\\uD801\\uDD6F|\\uD82F\\uDC9F|\\uD804[\\uDC47-\\uDC4D\\uDCBB\\uDCBC\\uDCBE-\\uDCC1\\uDD40-\\uDD43\\uDD74\\uDD75\\uDDC5-\\uDDC9\\uDDCD\\uDDDB\\uDDDD-\\uDDDF\\uDE38-\\uDE3D\\uDEA9]|\\uD800[\\uDD00-\\uDD02\\uDF9F\\uDFD0]|\\uD81A[\\uDE6E\\uDE6F\\uDEF5\\uDF37-\\uDF3B\\uDF44])*";
var re2 = RegExp(pP_block + "lipstick" + pP_block, "gi");
console.log(str.replace(re2, ""));
<script src="https://cdnjs.cloudflare.com/ajax/libs/xregexp/2.0.0/xregexp-all-min.js"></script>
POSIX flavor
You need to use a POSIX character class [:punct:] in a bracket expression as [[:punct:]], otherwise [:punct:] works as a bracket expression and matches either a colon, or p, or u, or n (that is why it is removed since the case insensitive matching is enabled with the i modifier), or c or t.

Groovy complaining about illegal character range in regex

Groovy 2.4 here. I am trying to build a regex that will filter out all the following characters:
`,./;[]-&<>?:"()|
Here's my best attempt:
static void main(String[] args) {
// `,./;[]-&<>?:"()|
String regex = "`,./;[]-&<>?:\"()|"
String test = "ooekrofkrofor ` oxkeoe , wdkeodeko / kodek ] woekoedk \" swjiej ' wsjwdjeiji :"
println test.replaceAll(regex, "")
}
However this produces a compile error on the regex string definition, complaining:
illegal character range (to < from)
Not sure if this is a Java or Groovy thing, but I can't figure out how to define the regex properly so that it quiets the error and correctly strips these "illegal characters" out of my string. Any ideas?
It seems to me you want to remove all the characters listed in your regex variable. The problem is that you declared a sequence while you need a character class (enclose the characters with []).
See Groovy demo:
String regex = "[`,./;\\[\\]&<>?:\"()|-]+"
^ ^^^^^^ ^ ^
String test = "ooekrofkrofor ` oxkeoe , wdkeodeko / kodek ] woekoedk \" swjiej ' wsjwdjeiji :"
println test.replaceAll(regex, "")
Output: ooekrofkrofor oxkeoe wdkeodeko kodek woekoedk swjiej ' wsjwdjeiji
The pattern now contains a character class matching any of the characters defined inside it - [`,./;\[\]&<>?:\"()|-] - one or more times due to the + quantifier. Note that inside the character class, ] and [ must always be escaped, and the - can be left unescaped when placed at the start/end of the character class.
You need to escape a few special characters in your pattern:
String regex = "[`,./;\\[]\\-&<>?:\"\\(\\)|]+"
Note using double \\ to turn them into a single \ in the string, so when the pattern is parsed, the next character is escaped.

How to match a string with an opening brace { in C++ regex

I have about writing regexes in C++. I have 2 regexes which work fine in java. But these throws an error namely
one of * + was not preceded by a valid regular expression C++
These regexes are as follows:
regex r1("^[\s]*{[\s]*\n"); //Space followed by '{' then followed by spaces and '\n'
regex r2("^[\s]*{[\s]*\/\/.*\n") // Space followed by '{' then by '//' and '\n'
Can someone help me how to fix this error or re-write these regex in C++?
See basic_regex reference:
By default, regex patterns follow the ECMAScript syntax.
ECMAScript syntax reference states:
characters:
\character
description: character
matches: the character character as it is, without interpreting its special meaning within a regex expression.
Any character can be escaped except those which form any of the special character sequences above.
Needed for: ^ $ \ . * + ? ( ) [ ] { } |
So, you need to escape { to get the code working:
std::string s("\r\n { \r\nSome text here");
regex r1(R"(^\s*\{\s*\n)");
regex r2(R"(^\s*\{\s*//.*\n)");
std::string newtext = std::regex_replace( s, r1, "" );
std::cout << newtext << std::endl;
See IDEONE demo
Also, note how the R"(pattern_here_with_single_escaping_backslashes)" raw string literal syntax simplifies a regex declaration.

regular expressions, delimiting plus sign

Private Const SEPARATOR_REG_EXP1 As String = "SCD\+4\+[A-Z]\+"
Public Function TestReg() As Boolean
Dim s1 As String = "SCD+4+ADJUSTMENT+"
Dim match As Match = Regex.Match(s1, SEPARATOR_REG_EXP1)
If match.Success Then
Return True
Else : Return False
End If
End Function
Not sure why this does not match - haven't really used regular expressions much.
The regex pattern should be :
"SCD\+4\+[A-Z]+\+"
You have to add a + sign after [A-Z], because you want to match one or multiple of these [A-Z] characters.
This does not match, because [A-Z]matches only a single character of the given character class. You can use the + quantifier to match multiple chars. The resulting RegEx would be
SCD\+4\+[A-Z]+\+