Regex in c++ for maching some patters

Regex in c++ for maching some patters - c++

I want regex of this.
add x2, x1, x0 is a valid instruction;
I want to implement this. But bit confused, how to, as I am newbie in using Regex. Can anyone share these Regex?

If this is a longer project and will have more requirements later, then definitely a different approach would be better.
The standard approach to solve such a problem ist to define a grammar and then created a lexer and a parser. The tools lex/yacc or flex/bison can be used for that. Or, a simple shift/reduce parser can also be hand crafted.
The language that you sketched with the given grammar, may be indeed specified with a Chomsky class 3 grammar, and can hence be produced gy a regular grammar. And, with that, parsed with regular expressions.
The specification is a little bit unclear as to what a register is and if there are more keyowrds. Especially ecall is unclear.
But how to build such a regex?
You will define small tokens and concatenate them. And different paths can be implemented with the or operator |.
Let's give sume example.
a register may be matched with a\d+. So, an "a" followed by ome digits. If it is not only "a", but other letters as well, you could use [a-z]\d+
op codes with the same number of parameters can be listed up with a simple or |. like in add|sub
For spaces there are many solutions. you may use \s+ or [ ]+or whatever spaces you need.
To build one rule, you can concatenate what you learned so far
Having different parts needs an or | for the complete path
If you want to get back the matched groups, you must enclose the needed stuff in brackets
And with that, one of many many possible solutions can be:
^[ ]*((add|sub)[ ]+(a\d+)[ ]*,[ ]*(a\d+)[ ]*,[ ]*(a\d+)|(ecall))[ ]*$
See example in: regex101

Related

Regex Replacement Syntax for number of replacent group occurences

Take the sample string:
__________Hello
I want to replace lines starting with 10 x _ with 20 x _
Desired output:
____________________Hello
I can do this a number of ways, i.e:
/^(_{10})/\1\1/
/^_{10}/____________________/
/^(__________)/\1\1/
etc...
Question:
Is there a way within the regex specification/expression itself - say PCRE (or any regex library/engine for that matter) - to specify the replacement occurence of a character ?
For example:
/_{10}/_{20}/
I don't know if I'm having a mind blank or if I've just never done this, but I cannot seem to find any such thing in the regex specification docs.

It can't be done within the Regex itself.
If I have the input "39572a4872" and I want to replace it with "39572aaaaa4872", there are many simple ways to achieve that, which can include Regular expressions, but as Wiktor explained in the comment thread, the actual quantifier of the replacement is not something itself that is achieved through regex.
It may seem unimportant, since in this example I could simply just apply the replacement 5 times manually or programatically, but one of the benefits of standardized technologies is applying the same concepts in different environments, languages, even within programs.
I as well as many others have had a lot of success with the portability of my regex because of this.
This question was to see if specifying quantifiers for replacement strings was possible within the syntax of a regex itself. Which it is surely not.

Why doesn't regex support inverse matching?

Several sources linked below seem to indicate regex wasn't designed for inverse matching - why not?
Recently, while trying to put together an answer for a question about a regex to match everything that was left after a specific pattern, I encountered several issues that left me curious about the limitations of regex.
Suppose we have some string: a simple line of text. I have a regex [a-zA-Z]e that will match one letter, followed by an e. This matches 3 times, on le, ne, and te. What if I want to match everything except patterns that match the regex? Suppose I want to capture a simp, li, of, and xt., including spaces (line breaks optional.) I later learned this behavior is called inverse matching, and shortly after, that it's not something regex easily supports.
I've examined some resources, but couldn't find any concrete answer on why inverse matching isn't "good".
Negative lookaheads appear useful for determining if a matched string does not contain some specific string, and are in fact used in several answers as methods to achieve this behavior (or something similar) - but they seem designed to act as a way to disqualify matches, as opposed to capturing non-matching input.
Negative lookaheads apparently shouldn't try to do this and aren't good at it either, choosing to leave inverse matching to the language they're being used with.
My own attempt at inverse matching was pointed out to be situational and very fragile, and looks convoluted even to me. In the comments, Wiktor Stribizew mentioned that "[...] in Java, you can't write a regex that matches any text other than some multicharacter string. With capturing, something can be done, but it is inefficient[.]"
Capture groups (the other method I was considering) appear to have the potential to dramatically slow the regex in more than one language.
All of these seem to indicate regex wasn't designed for inverse pattern matching, but none of them are immediately obvious as to the reasoning behind that. Why wasn't regex designed with built-in ability to perform inverse pattern matching?

While direct regex, as you pointed out, does not easily support the functionality you want, a regex split, does easily support this. Consider the following two scripts, first in Java and then in Python:
String input = "a simple line of text.";
String[] parts = input.split("[a-z]e");
System.out.println(Arrays.toString(parts));
This prints:
[a simp, li, of , xt.]
In Python, we can try something very similar:
inp = "a simple line of text."
parts = re.split(r'[a-z]e', inp)
print(parts)
This prints:
['a simp', ' li', ' of ', 'xt.']
The secret sauce which is missing in pure regex is that of parsing or iteration. A good programming language, such as the above, will expose an API which can iterate an input string, using a supplied pattern, and rollup the portions from the split pattern.

Conditional regular expression with one section dependent on the result of another section of the regex

Is it possible to design a regular expression in a way that a part of it is dependent on another section of the same regular expression?
Consider the following example:
(ABCHEHG)[HGE]{5,1230}(EEJOPK)[DM]{5}
I want to continue this regex, and at some point I will have a section where the result of that section should depend on the result of [DM]{5}.
For example, D will be complemented by C, and M will be complemented by N.
(ABCHEHG)[HGHE]{5,1230}(EEJOPK)[DM]{5}[ACF]{1,1000}(BBBA)[CU]{2,5}[D'M']{5}
By D' I mean C, and by M' I mean N.
So a resulting string that matches the above regex, if it has DDDMM matching to the section [DM]{5}, it should necessarily have CCCNN matching to [D'M']{5}. Therefore, the result of [D'M']{5} always depends on [DM]{5}, or in other words, what matches to [DM]{5} always dictates what will match to [D'M']{5}.
Is it possible to do such a thing with regex?
Please note that, in this example I have extremely over-simplified the problem. The regex pattern I currently have is really much more complex and longer and my actual pattern includes about 5-6 of such dependent sections.

I cannot think of a way you can do this in pure regex. I would run 2 regex expressions. The first regex to extract the [DM]{5} string, such as
(ABCHEHG)[HGHE]{5,1230}(EEJOPK)[DM]{5}
And take the last 5 characters. Now replace the characters, for example in C# it would be result = result.Substring(result.Length - 5, 5).Replace('D', 'C').Replace('M', 'N'), and then concatenate like
(ABCHEHG)[HGHE]{5,1230}(EEJOPK)[DM]{5}[ACF]{1,1000}(BBBA)[CU]{2,5} + result

This is pretty easy to do in Perl:
m{
ABCHEHG
[HGHE]{5,1230}
EEJOPK
( [DM]{5} )
[ACF]{1,1000}
BBBA
[CU]{2,5}
(??{ $1 =~ tr/DM/CN/r })
}x
I've added the x modifier and whitespace for better readability. I've also removed the capturing groups around the fixed strings (they're fixed strings; you already know what they're going to capture).
The crucial part is that we capture the string that was actually matched by [DM]{5} (in $1), which we then use at the end to dynamically generate a subpattern by replacing all D by C and M by N in $1.

This sounds like bioinformatics in python. Do 2-stage filtering, at regex level and at app level.
Wildcard the DM portions, so the regex is permissive in what it accepts. Bury the regex in a token generator that yields several matching sections. Have your app iterate through the generator's results, discarding any result rejected by your business logic, such as finding that one token is not the complement of another token.
Alternatively, you might push some of that work down into a complex generated regex, which likely will perform worse and will be harder to debug. Your DDDMM example might be summarized as D+M+, or [DM]+, not sure if sequence matters. The complement might be C+N+, or [CN]+. Apparently there's two cases here. So start assembling a regex: stuff1 [DM]+ stuff2 [CN]+ stuff3. Then tack on '|' for alternation, and tack on the other case: stuff1 [CN]+ stuff2 [DM]+ stuff3 (or factor out suffix and prefix so alternation starts after stuff1). I can't imagine you'll be happy with such an approach, as the combinatorics get ugly, and the regex engine is forced to do lots of scanning and backtracking. And recompiling additional regexes on the fly doesn't come for free. Instead you should use the regex engine for the simple things that it's good at, and delegate complex business logic decisions to your app.

Transform regex with character classes and repetitions to its most basic ASCII form

Is there a way, a regular expression maybe or even a library, which can transform a regular expression with character classes and repetition to its most basic ASCII form.
For example I'd like to have the following conversions:
\d -> [0-9]
\w -> [A-Za-z0-9_]
\s -> [ \t\r\n\v\f]
\d{2} -> [0-9][0-9]
\d{3,} -> [0-9][0-9][0-9]+
\d{,3} -> I dont even know how to show this...

There is a commercial product called RegexBuddy that lets you enter a regex in their syntax and then generate the version for any of a number of popular systems. There may be something similar out there for free, or you could write your own.
At its most basic, a regular expression syntax only needs two things: alternation (OR) and closure (STAR). Well, and grouping. OK, three things. Other common operators are just shortcuts, really:
x+ = xx*
x? = (|x)
[xyz] = (x|y|z)
etc.
Things like \d just map to character classes and then to alternations. Negated character classes and . map to very big alternations. :)
There are some features that don't translate, however, such as lookaround. Mapping those to something that works without the feature is not readily automatable; it will depend upon the particular circumstances motivating their use.

First, you'd have to define which transformations you want to do. As written in the comments, not all advanced features can be written in terms of simpler operators. For example, the lookaround operators have no substitute. So you're limited by the target regexp parser anyway.
Then, with this list of transformations, you should simply apply them. They can probably be written as regexps themselves, but it might be easier to write a script in Python or so to actually parse (but not evaluate) the regexp. Then it can write it back with the requested transformations applied. And bark at you if you've used too complex features.
This wouldn't be too hard, but I'm not so sure if it would be very useful either. If you need powerful regexps, use a better regexp engine. It should be easy to write a simple Python or Perl script instead of a simple Awk script, for example.

differentiating and testing regex variants

Several implementations of regular expressions differ from each other in subtle ways which is the source of much confusion when I try to use them.
Most of these differences include the semantics related to whether a character is escaped or not. This is most often an issue with parentheses, but can apply to curly brackets and others. This is probably a consequence of the syntax of the language or environment in which the implementation is found. For instance, if the $ symbol indicates a variable name in some language, one can expect regular expressions represented in that language would require escaping the "end of line" anchor to \$ or some such. But what gets confusing at this point is how you would represent an actual dollar sign. I believe Perl gets around this by wrapping a regex inside forward slashes /.
Similarly there are escapes for specific characters themselves, for instance non printing characters such as \n and \t. Then there are the similar looking generic character groups such as \d for digits, \s for whitespace, and \w which I just learned covers underscores as well as digits. I found myself on several occasions trying to use \a for a "alphabetical" group but this only ended up matching the bell character 0x07.
It's pretty clear that there is no simple and one-shot solution to knowing all of the differences in features and syntax offered by the myriad of implementations of regular expressions out there, short of somebody doing all the hard work and putting results in a well organized table. Here is one example of exactly this, but of course it doesn't cover several of the programs that I use extensively myself, which include vim, sed, Notepad++, Eclipse, and believe it or not MS Word (at least version 2010, I suspect 2007 also has this, they call it "wildcards") has a simple regex implementation too.
I guess what I want is to be as lazy as possible (in a certain sense) by trying to come up with a way to determine for any given regex implementation what its "escape settings" are beyond any doubt by applying one (or a few) queries.
I'm thinking I can make a file which contains test cases, along with a huge regex query, and somehow engineer it so that running it once will show me exactly what syntax I need to use subsequently without doubting myself any further. (as opposed to having to edit files and use multiple queries to figure out the same thing which gets terribly old after a while).
If nobody else has attempted to construct such a monstrosity, I may undertake this task myself. If it's even possible. Is this possible?
I tried to come up with an example (it was just to figure out if EOL anchor is $ or \$) but in every case I had to use a multitude of different search/replace queries in order to determine how the program will respond to the input.
Edit: I came up with something using capturing and backtracking. I gotta work on it a little more.
Update: Well, Notepad++ does not implement the OR operator commonly denoted by the pipe |. Word's "wildcards" is a poor substitute also, it doesn't have | or *. I'm fairly certain that missing any of the regular expression operators (union, concat, star) means it cannot generate a regular grammar, so those two are ruled out.
I can create an input file like this:
$
*
]
EOL
and query
(\$)|(\*)|(\[)|($)
replacing with
escDollar:\1:escStar:\2:escSQBrL:\3:Dollar:\4:
yields a result of (assuming unescaped parens is group and unescaped pipe is or)
escDollar:$:escStar::escSQBrL::Dollar::
escDollar::escStar:*:escSQBrL::Dollar::
]escDollar::escStar::escSQBrL::Dollar::
EOLescDollar::escStar::escSQBrL::Dollar::
I ran this in vim. This output would demonstrate the single characters that are matched by each item specified next to it, i.e. the escaped dollar sign item is seen to match the actual dollar sign character rather than the non escaped dollar sign item at the end.
It's difficult to see what's going on with the $ anchor since it matches zero characters, but it shouldn't be hard to find a solution for it. Besides it's not a commonly mistaken one. The ones I'm particularly worried about are pipe and parens and the different brackets. When you've got 4 different types in there there are 2^4 combinations of escaped and non-escaped versions of them you can use. Trial-and-error with that is horrific.
This output isn't too hard to parse at a glance, and is also seriously easy to process as part of a script. The one glaring problem that remains is figuring out whether parens and pipe need to be escaped. Because the functionality of the whole thing depends on them.
It would seem like that will require multiple queries. It may be possible with a cleverly engineered jumble of backslashes, parens, and pipes to figure out the combination (only 4 possibilities after all) with an initial query, then choose the subsequent matrix generator query based on it.
Something like this shows it can work:
(e)
(f)
querying
$(f$)|\|$(e$)
replace with
\1:\2
would produce:
:(e if escaped parens is group and escaped pipe is or
:e) if parens is group and escaped pipe is or
(f: if escaped parens is group and pipe is or
f): if parens is group and pipe is or
I still don't really like this though because it requires a second query on a second set of input. Too much setting up. I may just make 4 copies of the "matrix" thing.

The table on this page summarizes quite nicely which features are available in which regex implementations:
http://www.regular-expressions.info/refflavors.html

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js