Why can't Regular Expressions use keywords instead of characters? - regex

Okay, I barely understand RegEx basics, but why couldn't they design it to use keywords (like SQL) instead of some cryptic wildcard characters and symbols?
Is it for performance since the RegEx is interpreted/parsed at runtime? (not compiled)
Or maybe for speed of writing? Considering that when you learn some "simple" character combinations it becomes easier to type 1 character instead of a keyword?

You really want this?
Pattern findGamesPattern = Pattern.With.Literal(#"<div")
.WhiteSpace.Repeat.ZeroOrMore
.Literal(#"class=""game""").WhiteSpace.Repeat.ZeroOrMore.Literal(#"id=""")
.NamedGroup("gameId", Pattern.With.Digit.Repeat.OneOrMore)
.Literal(#"-game""")
.NamedGroup("content", Pattern.With.Anything.Repeat.Lazy.ZeroOrMore)
.Literal(#"<!--gameStatus")
.WhiteSpace.Repeat.ZeroOrMore.Literal("=").WhiteSpace.Repeat.ZeroOrMore
.NamedGroup("gameState", Pattern.With.Digit.Repeat.OneOrMore)
.Literal("-->");
Ok, but it's your funeral, man.
Download the library that does this here:
http://flimflan.com/blog/ReadableRegularExpressions.aspx

Regular expressions have a mathematical (actually, language theory) background and are coded somewhat like a mathematical formula. You can define them by a set of rules, for example
every character is a regular expression, representing itself
if a and b are regular expressions, then a?, a|b and ab are regular expressions, too
...
Using a keyword-based language would be a great burden for simple regular expressions. Most of the time, you will just use a simple text string as search pattern:
grep -R 'main' *.c
Or maybe very simple patterns:
grep -c ':-[)(]' seidl.txt
Once you get used to regular expressions, this syntax is very clear and precise. In more complicated situations you will probably use something else since a large regular expression is obviously hard to read.

Perl 6 is taking a pretty revolutionary step forward in regex readability. Consider an address of the form:
100 E Main St Springfield MA 01234
Here's a moderately-readable Perl 5 compatible regex to parse that (many corner cases not handled):
m/
([1-9]\d*)\s+
((?:N|S|E|W)\s+)?
(\w+(?:\s+\w+)*)\s+
(ave|ln|st|rd)\s+
([:alpha:]+(?:\s+[:alpha:]+)*)\s+
([A-Z]{2})\s+
(\d{5}(?:-\d{4})?)
/ix;
This Perl 6 regex has the same behavior:
grammar USMailAddress {
rule TOP { <addr> <city> <state> <zip> }
rule addr { <[1..9]>\d* <direction>?
<streetname> <streettype> }
token direction { N | S | E | W }
token streetname { \w+ [ \s+ \w+ ]* }
token streettype {:i ave | ln | rd | st }
token city { <alpha> [ \s+ <alpha> ]* }
token state { <[A..Z]>**{2} }
token zip { \d**{5} [ - \d**{4} ]? }
}
A Perl 6 grammar is a class, and the tokens are all invokable methods. Use it like this:
if $addr ~~ m/^<USMailAddress::TOP>$/ {
say "$<city>, $<state>";
}
This example comes from a talk I presented at the Frozen Perl 2009 workshop. The Rakudo implementation of Perl 6 is complete enough that this example works today.

Well, if you had keywords, how would you easily differentiate them from actually matched text? How would you handle whitespace?
Source text
Company: A Dept.: B
Standard regex:
Company:\s+(.+)\s+Dept.:\s+(.+)
Or even:
Company: (.+) Dept. (.+)
Keyword regex (trying really hard not get a strawman...)
"Company:" whitespace.oneplus group(any.oneplus) whitespace.oneplus "Dept.:" whitespace.oneplus group(any.oneplus)
Or simplified:
"Company:" space group(any.oneplus) space "Dept.:" space group(any.oneplus)
No, it's probably not better.

Because it corresponds to formal language theory and it's mathematic notation.

It's Perl's fault...!
Actually, more specifically, Regular Expressions come from early Unix development, and concise syntax was a lot more highly valued then. Storage, processing time, physical terminals, etc were all very limited, rather unlike today.
The history of Regular Expressions on Wikipedia explains more.
There are alternatives to Regex, but I'm not sure any have really caught on.
EDIT: Corrected by John Saunders: Regular Expressions were popularised by Unix, but first implemented by the QED editor. The same design constraints applied, even more so, to earlier systems.

Actually, no, the world did not begin with Unix. If you read the Wikipedia article, you'll see that
In the 1950s, mathematician Stephen Cole Kleene described these models using his mathematical notation called regular sets. The SNOBOL language was an early implementation of pattern matching, but not identical to regular expressions. Ken Thompson built Kleene's notation into the editor QED as a means to match patterns in text files. He later added this capability to the Unix editor ed, which eventually led to the popular search tool grep's use of regular expressions

This is much earlier than PERL. The Wikipedia entry on Regular Expressions attributes the first implementations of regular expressions to Ken Thompson of UNIX fame, who implemented them in the QED and then the ed editor. I guess that the commands had short names for performance reasons, but much before being client-side. Mastering Regular Expressions is a great book about regular expressions, which offers the option to annotate a regular expression (with the /x flag) to make it easier to read and understand.

Because the idea of regular expressions--like many things that originate from UNIX--is that they are terse, favouring brevity over readability. This is actually a good thing. I've ended up writing regular expressions (against my better judgement) that are 15 lines long. If that had a verbose syntax it wouldn't be a regex, it'd be a program.

It's actually pretty easy to implement a "wordier" form of regex -- please see my answer here. In a nutshell: write a handful of functions that return regex strings (and take parameters if necessary).

I don't think keywords would give any benefit. Regular expressions as such are complex but also very powerful.
What I think is more confusing is that every supporting library invents its own syntax instead of using (or extending) the classic Perl regex (e.g. \1, $1, {1}, ... for replacements and many more examples).

I know its answering your question the wrong way around, but RegExBuddy has a feature that explains your regexpression in plain english. This might make it a bit easier to learn.

If the language you are using supports Posix regexes, you can use them.
An example:
\d
would be the same as
[:digit:]
The bracket notation is much clearer on what it is matching. I would still learn the "cryptic wildcard characters and symbols, since you will still see them in other people's code and need to understand them.
There are more examples in the table on regular-expressions.info's page.

For some reason, my previous answer got deleted. Anyway, i thing ruby regexp machine would fit the bill, at http://www.rubyregexp.sf.net. It is my own project, but i think it should work.

Related

Regex in c++ for maching some patters

I want regex of this.
add x2, x1, x0 is a valid instruction;
I want to implement this. But bit confused, how to, as I am newbie in using Regex. Can anyone share these Regex?
If this is a longer project and will have more requirements later, then definitely a different approach would be better.
The standard approach to solve such a problem ist to define a grammar and then created a lexer and a parser. The tools lex/yacc or flex/bison can be used for that. Or, a simple shift/reduce parser can also be hand crafted.
The language that you sketched with the given grammar, may be indeed specified with a Chomsky class 3 grammar, and can hence be produced gy a regular grammar. And, with that, parsed with regular expressions.
The specification is a little bit unclear as to what a register is and if there are more keyowrds. Especially ecall is unclear.
But how to build such a regex?
You will define small tokens and concatenate them. And different paths can be implemented with the or operator |.
Let's give sume example.
a register may be matched with a\d+. So, an "a" followed by ome digits. If it is not only "a", but other letters as well, you could use [a-z]\d+
op codes with the same number of parameters can be listed up with a simple or |. like in add|sub
For spaces there are many solutions. you may use \s+ or [ ]+or whatever spaces you need.
To build one rule, you can concatenate what you learned so far
Having different parts needs an or | for the complete path
If you want to get back the matched groups, you must enclose the needed stuff in brackets
And with that, one of many many possible solutions can be:
^[ ]*((add|sub)[ ]+(a\d+)[ ]*,[ ]*(a\d+)[ ]*,[ ]*(a\d+)|(ecall))[ ]*$
See example in: regex101

greedy operator in regular expression is not working in Tcl 8.5

See this simple regexp code:
puts [ regexp -inline {^\-\-\S+?=\S+} "--tox=9.0" ]
The output is:
>--tox=9
It would seem that the second \S+ is being non-greedy! Only 1 character is being matched
In PERL, one can can see that the result is as I expected, see 1 line output:
perl -e '"--tox=9.0" =~/(^\-\-\S+?=\S+)/ ; print "${1}\n"'
--tox=9.0
How can I get the Perl behaviour in Tcl?
This is an inherent 'feature' of Tcl's regexp implementation. For instance, the below is from Henry Spencer (the one who did most if not all of Tcl's regexp work I believe)
It is very difficult to come up with an entirely satisfactory
definition of the behavior of mixed-greediness regular expressions.
Perl doesn't try: the Perl "specification" is a description of the
implementation, an inherently low-performance approach involving
trying one match at a time. This is unsatisfactory for a number of
reasons, not least being that it takes several pages of text merely to
describe it. (That implementation and its description are distant,
mutated descendants of one of my earlier regexp packages, so I share
some of the blame for this.)
When all quantifiers are greedy, the Tcl 8.2 regexp matches the
longest possible match (as specified in the POSIX standard's
regular-expression definition). When all are non-greedy, it matches
the shortest possible match. Neither of these desirable statements is
true of Perl.
The trouble is that it is very, very hard to write a generalization of
those statements which covers mixed-greediness regular expressions --
a proper, implementation-independent definition of what
mixed-greediness regular expressions should match -- and makes them
do "what people expect". I've tried. I'm still trying. No luck so
far.
The rules in the Tcl 8.2 regexp, which basically give the whole regexp
a long/short preference based on its subexpressions, are the best I've
come up with so far. The code implements them accurately. I agree
that they fall short of what's really wanted. It's trickier than it
looks.
Basically, expressions with mixed greedy and non-greedy quantifiers impacts both the simplicity of the implementation and the performance. So, the implementation makes it so that the first 'type' of quantifier is passed on to all other quantifiers.
In other words, if the first quantifier is greedy, all the others will be greedy. If the first is non-greedy, all the others will be non-greedy. And therefore, you cannot force a Tcl regexp to work like a Perl regexp (or maybe you can through exec and using the bash command version of perl, but I'm not familiar with this).
I would advise using negated classes and/or anchors instead of non-greedy.
Since I don't know the exact context of your question, I won't provide an alternative regexp, because that will depend on whether this is really the whole string you are trying to make a match on.
The Tcl regular expression engine is an automata-theoretic one instead of a stack-based one, so it has a very different approach to matching mixed greediness REs. In particular, for the sort of RE you're talking about, that will be interpreted as entirely non-greedy.
The simplest method of fixing this is to use a different RE. Remembering that \S is just a shorthand for [^\s], we can do this (excluding = from the first part):
puts [ regexp -inline {^--[^\s=]+=\S+} "--tox=9.0" ]
(I also changed \- to - as it's not a special character in Tcl's REs.)
The answer can be found here:
Unfortunately, the answer is that to get the same answer Perl gives,
you have to use Perl's exact regexp implementation.
In your case, I'd use both anchors, ^ and $:
puts [ regexp -inline {^\-\-\S+?=\S+$} "--tox=9.0" ]
The result is: --tox=9.0

Negation of a regular expression

I am not sure how it is called: negation, complementary or inversion. The concept is this. For example having alphabet "ab"
R = 'a'
!R = the regexp that matche everyhting exept what R matches
In this simple example it should be soemthing like
!R = 'b*|[ab][ab]+'
How is such a regexp called? I remeber from my studies that there is a way to calculate that, but it is something complicated and generally too hard to make by hand. Is there a nice online tool (or regular software) to do that?
jbo5112's answer gives good practical help. However, on the theoretical side: a regular expression corresponds to a regular language, so the term you're looking for is complementation.
To complement a regex:
Convert into the equivalent NFA. This is a well-known and defined process.
Convert the NFA to a DFA via the powerset construction
Complement the DFA by making accept states not accept and vice versa.
Convert the DFA to a regular expression.
You now have the complement of the original regular expression!
If all you're doing is searching, then some software/languages for regular expressions have a way to negate the match built in. For example, with grep you can use a '-v' option to get lines that don't match and the SQL variants I've seen allow you to use a 'not' qualifier to negate the match.
Another option that some/most/all regex dialects support is to use "negative lookahead". You may have to look up your specific syntax, but it's an interesting tool that is well worth reading about. Generally it's something like this: if R='<regex>', then Negative_of_R='(?!<regex>)'. Unfortunately, it can vary with the peculiarities of your language (e.g. vim uses \(<regex>\)\#!).
A word of caution: If you're not careful, a negated regular expression will match more than you expect. If you have the text This doesn't match 'mystring'. and search for (?!mystring), then it will match everything except the 'm' in mystring.

Transform regex with character classes and repetitions to its most basic ASCII form

Is there a way, a regular expression maybe or even a library, which can transform a regular expression with character classes and repetition to its most basic ASCII form.
For example I'd like to have the following conversions:
\d -> [0-9]
\w -> [A-Za-z0-9_]
\s -> [ \t\r\n\v\f]
\d{2} -> [0-9][0-9]
\d{3,} -> [0-9][0-9][0-9]+
\d{,3} -> I dont even know how to show this...
There is a commercial product called RegexBuddy that lets you enter a regex in their syntax and then generate the version for any of a number of popular systems. There may be something similar out there for free, or you could write your own.
At its most basic, a regular expression syntax only needs two things: alternation (OR) and closure (STAR). Well, and grouping. OK, three things. Other common operators are just shortcuts, really:
x+ = xx*
x? = (|x)
[xyz] = (x|y|z)
etc.
Things like \d just map to character classes and then to alternations. Negated character classes and . map to very big alternations. :)
There are some features that don't translate, however, such as lookaround. Mapping those to something that works without the feature is not readily automatable; it will depend upon the particular circumstances motivating their use.
First, you'd have to define which transformations you want to do. As written in the comments, not all advanced features can be written in terms of simpler operators. For example, the lookaround operators have no substitute. So you're limited by the target regexp parser anyway.
Then, with this list of transformations, you should simply apply them. They can probably be written as regexps themselves, but it might be easier to write a script in Python or so to actually parse (but not evaluate) the regexp. Then it can write it back with the requested transformations applied. And bark at you if you've used too complex features.
This wouldn't be too hard, but I'm not so sure if it would be very useful either. If you need powerful regexps, use a better regexp engine. It should be easy to write a simple Python or Perl script instead of a simple Awk script, for example.

regular expression to match nested braces

I need regular expression to match braces correct e.g for every open one close one
abc{abc{bc}xyz} I need it get all it from {abc{bc}xyz} not get {abc{bc}.
I tried using (\{.*?})
This is not possible with regular expressions. A context-free grammar would be necessary for this and regular expressions only work for finite regular languages.
According to this link there is an extension available for the regular expressions in .NET that can do this, but this just means that .NET regular expressions are more than just regular expressions.
This is not a task for a regular expression. What you're looking for is parser at that point. Which means a language grammar, LL(1), LALR, recursive-descent, the dragon book, and generally a splitting migraine.
Balanced parenthesis of arbitrary nested depth is not a regular language. It's a context-free language.
That said, many "regular expression" implementations actually recognize more than regular languages, so this is possible with some implementation but not others.
Wikipedia
Regular language
Pumping lemma for regular languages
Context-free language
Regular expression
Many features found in modern regular expression libraries provide an expressive power that far exceeds the regular languages.
As Bryan said, regular expressions might not be the right tool here, but if you're using PHP, the manual gives an example of how you might be able to use regular expressions in a recursive/nested fashion:
$input = "plain [indent] deep [indent] deeper [/indent] deep [/indent] plain";
function parseTagsRecursive($input)
{
$regex = '#\[indent]((?:[^[]|\[(?!/?indent])|(?R))+)\[/indent]#';
if (is_array($input)) {
$input = '<div style="margin-left: 10px">'.$input[1].'</div>';
}
return preg_replace_callback($regex, 'parseTagsRecursive', $input);
}
$output = parseTagsRecursive($input);
echo $output;
I'm not sure if that'll be helpful to you or not.
This is not possible in the "standard" regular expression language. However, a few different implementations have extensions that allow you to implement it. For example, here's a blog post that explains how to do it with .NET's regex library.
Generally speaking though, this is a task that regular expressions are not really suited to.
Assuming what you want to do is select a maximal substring between { and }:
.*? is a lazy quantifier. That is, it will match the least number of characters possible. If you change your expression to {.*}, you should find it will work.
If what you want to do is to verify that the braces are matched correctly, then as the other answers have stated, this is not possible with a (single) regular expression. You can do it by scanning the string with a stack though. Or with some voodoo of iterating your regular expression over the previous maximal match. Yikes.