Is the syntax for writing regular expression standardized - regex

Is the syntax for writing regular expression standardized? That is, if I write a regular expression in C++ it will work in Python or Javascript without any modifications.

No, there are several dialects of Regular Expressions.
They generally have many elements in common.
Some popular ones are listed and compared here.

Simple regular expressions, mostly yes. However, across the spectrum of programming languages, there are differences.

No, here are some differences that comes to mind:
JavaScript lets you write inline regex (where \ in \s need not be escaped as \\s), that are delimited by the / character. You can specify flags after the closing /. JS also has RegExp constructor that takes the escaped string as the first argument and an optional flag string as second argument.
/^\w+$/i and new RegExp("^\\w+$", "i") are valid and the same.
In PHP, you can enclose the regex string inside an arbitrary delimiter of your choice (not sure of the super set of characters that can be used as delimiters though). Again you should escape backslashes here.
"|[0-9]+|" is same as #[0-9]+#
Python and C# supports raw strings (not limited to regex, but really helpful for writing regex) that lets you write unescaped backslashes in your regex.
"\\d+\\s+\\w+" can be written as r'\d+\s+\w+' in Python and #'\d+\s+\w+' in C#
Delimiters like \<, \A etc are not globally supported.
JavaScript doesn't support lookbehind and the DOTALL flag.

Related

Regex character interval with exception

Say I have an interval with characters ['A'-'Z'], I want to match every of these characters except the letter 'F' and I need to do it through the ^ operator. Thus, I don't want to split it into two different intervals.
How can I do it the best way? I want to write something like ['A'-'Z']^'F' (All characters between A-Z except the letter F). This site can be used as reference: http://regexr.com/
EDIT: The relation to ocaml is that I want to define a regular expression of a string literal in ocamllex that starts/ends with a doublequote ( " ) and takes allowed characters in a certain range. Therefore I want to exclude the doublequotes because it obviously ends the string. (I am not considering escaped characters for the moment)
Since it is very rare to find two regular expressions libraries / processors with exactly the same regular expression syntax, it is important to always specify precisely which system you are using.
The tags in the question lead me to believe that you might be using ocamllex to build a scanner. In that case, according to the documentation for its regular expression syntax, you could use
['A'-'Z'] # 'F'
That's loosely based on the syntax used in flex:
[A-Z]{-}[F]
Java and Ruby regular expressions include a similar operator with very different syntax:
[A-Z&&[^F]]
If you are using a regular expression library which includes negative lookahead assertions (Perl, Python, Ecmascript/C++, and others), you could use one of those:
(?!F)[A-Z]
Or you could use a positive lookahead assertion combined with a negated character class:
(?=[A-Z])[^F]
In this simple case, both of those constructions effectively do a conjunction, but lookaround assertions are not really conjunctions. For a regular expression system which does implement a conjunction operator, see, for example, Ragel.
The ocamllex syntax for character set difference is:
['A'-'Z'] # 'F'
which is equivalent to
['A'-'E' 'G'-'Z']
(?!F)[A-Z] or ((?!F)[A-Z])*
This will match every uppercase character excluding 'F'
Use character class subtraction:
[A-Z&&[^F]]
The alternative of [A-EG-Z] is "OK" for a single exception, but breaks down quickly when there are many exceptions. Consider this succinct expression for consonants (non-vowels):
[B-Z&&[^EIOU]]
vs this train wreck
[B-DF-HJ-NP-TV-Z]
The regex below accomplishes what you want using ^ and without splitting into different intervals. It also resambles your original thought (['A'-'Z']^'F').
/(?=[A-Z])[^F]/ig
If only uppercase letters are allowed simple remove the i flag.
Demo

The Different Delimiters of Regex

When I look up regular expressions for various purposes, I see people using delimiters like /, #, !, and ~. Do these do anything different, or do they have the same effect?
They don't do anything different, they delimit the regular expression (in languages where it is needed).
The difference is: the behaviour of that character inside the regex does change. The regex delimiter becomes an additional special character and needs to be escaped (==> choose a delimiter that you don't need within the regex!).
Side note: In php you can even use a regex special character like + or | as regex delimiter, but this works only when you don't need that character inside the regex (NOT recommended). related answer
In some languages you can choose the delimiters, in others you can't.
You must escape that delimiter every time it appears in the regular expression. Choosing a delimiter that does not occur in the expression reduces the need for escaping, making the expression easier to read.
The following two regular expressions are identical, except that the first uses / as a delimiter, whereas the second uses #:
/http:\/\/example\.com\/.*\/foo\//
#http://example\.com/.*/foo/#

Seeking quoted string generator

Does anyone know of a free website or program which will take a string and render it quoted with the appropriate escape characters?
Let's say, for instance, that I want to quote
It's often said that "devs don't know how to quote nested "quoted strings"".
And I would like to specify whether that gets enclosed in single or double quotes. I don't personally care for any escape character other than backslash, but other's might.
If none of the double quotes of the string is already escaped, you can simply do:
str = str.replace(/"/g, "\\\"");
Otherwise, you should check if it is already escaped and replace only if it isn't; You can use lookbehind for that. The following is what came to my mind first but it would fail for strings like escaped backslash followed by quotes \\" :(
str = str.replace(/(?<!\\)"/g, "\\\"");
The following makes sure that the second last character, if exists, is not a backslash.
str = str.replace(/(?<!(^|[^\\])\\)"/g, "\\\"");
Update: Just remembered that JavaScript doesn't support look-behind; you can use the same regex on a look-behind supporting regex engine like perl/php/.net etc.
Any decent regex library in any decent programming language will have a function to do this - not that it's hard to write one yourself (as the other answers have indicated). So having a separate website or program to do it would be mostly useless.
Perl has the quotemeta function
PCRE's C++ wrapper has a function RE::QuoteMeta (warning: giant file at that link) which does the same thing
PHP has preg_quote if you're using Perl-compatible regexes
Python's re module has an escape function
In Java, the java.util.regex.Pattern class has a quote method
Perl and most of the other regular expression engines based on Perl have metacharacters \Q...\E, meaning that whatever comes between \Q and \E is interpreted literally
Most tools that use POSIX regular expressions (e.g. grep) have an option that makes them interpret their input as a literal string (e.g. grep -F)
In Python, for enclosing in single quotes:
import re
mystr = """It's often said that "devs don't know how to quote nested "quoted strings""."""
print("""'%s'""" % re.sub("'", r"\'", mystr))
Output:
'It\'s often said that "devs don\'t know how to quote nested "quoted strings"".'
You could easily adapt this into a more general form, and/or wrap it in a script for command-line invocation.
so, I guess the answer is "no". Sorry, guys, but I didn't learn anything that I don't know. Probably my fault for not phrasing the question correctly.
+1 for everyone who posted

How can I convert a Perl regex to work with Boost::Regex?

What is the Boost::Regex equivalent of this Perl regex for words that end with ing or ed or en?
/ing$|ed$|en$/
...
The most important difference is that regexps in C++ are strings so all regexp specific backslash sequences (such as \w and \d should be double quoted ("\\w" and "\\d")
/^[\.:\,()\'\`-]/
should become
"^[.:,()'`-]"
The special Perl regex delimiter / doesn't exist in C++, so regexes are just a string. In those strings, you need to take care to escape backslashes correctly (\\ for every \ in your original regex). In your example, though, all those backslashes were unnecessary, so I dropped them completely.
There are other caveats; some Perl features (like variable-length lookbehind) don't exist in the Boost library, as far as I know. So it might not be possible to simply translate any regex. Your examples should be fine, though. Although some of them are weird. .*[0-9].* will match any string that contains a number somewhere, not all numbers.

Regular expression opening and closing characters

When I learned regular expressions I learned they should start and end with a slash character (followed by modifiers).
For example /dog/i
However, in many examples I see them starting and ending with other characters, such as #, #, and |.
For example |dog|
What's the difference?
This varies enormously from one regex flavor to the next. For example, JavaScript only lets you use the forward-slash (or solidus) as a delimiter for regex literals, but in Perl you can use just about any punctuation character--including, in more recent versions, non-ASCII characters like « and ». When you use characters that come in balanced pairs like braces, parentheses, or the double-arrow quotes above, they have to be properly balanced:
m«\d+»
s{foo}{bar}
Ruby also lets you choose different delimiters if you use the %r prefix, but I don't know if that extends to the balanced delimiters or non-ASCII characters. Many languages don't support regex literals at all; you just write the regexes as string literals, for example:
r'\d+' // Python
#"\d+" // C#
"\\d+" // Java
Note the double backslash in the Java version. That's necessary because the string gets processed twice: once by the Java compiler and once by the compile() method of the Pattern class. Most other languages provide a "raw" or "verbatim" form of string literal that all but eliminates such backslash-itis.
And then there's PHP. Its preg regex functions are built on top of the PCRE library, which closely imitates Perl's regexes, including the wide variety of delimiters. However, PHP itself doesn't support regex literals, so you have to write them as if they were regex literals embedded in string literals, like so:
'/\d+/g' // match modifiers go after the slash but inside the quotes
"{\\d+}" // double-quotes may or may not require double backslashes
Finally, note that even those languages which do support regex literals don't usually offer anything like Perl's s/…/…/ construct. The closest equivalent is a function call that takes a regex literal as the first argument and a string literal as the second, like so:
s = s.replace(/foo/i, 'bar') // JavaScript
s.gsub!(/foo/i, "bar") // Ruby
Some RE engines will allow you to use a different character so as to avoid having to escape those characters when used in the RE.
For example, with sed, you can use either of:
sed 's/\/path\/to\/directory/xx/g'
sed 's?/path/to/directory?xx?g'
The latter is often more readable. The former is sometimes called "leaning toothpicks". With Perl, you can use either of:
$x =~ /#!\/usr\/bin\/perl/;
$x =~ m!#\!/usr/bin/perl!;
but I still contend the latter is easier on the eyes, especially as the REs get very complex. Well, as easy on the eyes as any Perl code could be :-)