Negation of a regular expression - regex

I am not sure how it is called: negation, complementary or inversion. The concept is this. For example having alphabet "ab"
R = 'a'
!R = the regexp that matche everyhting exept what R matches
In this simple example it should be soemthing like
!R = 'b*|[ab][ab]+'
How is such a regexp called? I remeber from my studies that there is a way to calculate that, but it is something complicated and generally too hard to make by hand. Is there a nice online tool (or regular software) to do that?

jbo5112's answer gives good practical help. However, on the theoretical side: a regular expression corresponds to a regular language, so the term you're looking for is complementation.
To complement a regex:
Convert into the equivalent NFA. This is a well-known and defined process.
Convert the NFA to a DFA via the powerset construction
Complement the DFA by making accept states not accept and vice versa.
Convert the DFA to a regular expression.
You now have the complement of the original regular expression!

If all you're doing is searching, then some software/languages for regular expressions have a way to negate the match built in. For example, with grep you can use a '-v' option to get lines that don't match and the SQL variants I've seen allow you to use a 'not' qualifier to negate the match.
Another option that some/most/all regex dialects support is to use "negative lookahead". You may have to look up your specific syntax, but it's an interesting tool that is well worth reading about. Generally it's something like this: if R='<regex>', then Negative_of_R='(?!<regex>)'. Unfortunately, it can vary with the peculiarities of your language (e.g. vim uses \(<regex>\)\#!).
A word of caution: If you're not careful, a negated regular expression will match more than you expect. If you have the text This doesn't match 'mystring'. and search for (?!mystring), then it will match everything except the 'm' in mystring.

Related

How to match Regular Expression with String containing a wildcard character?

Regular expression:
/Hello .*, what's up?/i
String which may contain any number of wildcard characters (%):
"% world, what's up?" (matches)
"Hello world, %?" (matches)
"Hello %, what's up?" (matches)
"Hey world, what's up?" (no match)
"Hello %, blabla." (no match)
I have thought of a solution myself, but I'd like to see what you are able to come up with (considering performance is a high priority). A requirement is the ability to use any regular expression; I only used .* in the example, but any valid regular expression should work.
A little automata theory might help you here. You say
this is a simplified version of matching a regular expression with a regular expression[1]
Actually, that does not seem to be the case. Instead of matching the text of a regular expression, you want to find regular expressions that can match the same string as a given regular expression.
Luckily, this problem is solvable :-) To see whether such a string exists, you would need to compute the union of the two regular languages and test whether the result is not the empty language. This might be a non-trivial problem and solving it efficiently [enough] may be hard, but standard algorithms for this do already exist. Basically you would need to translate the expression into a NFA, that one into a DFA which you then can union.
[1]: Indeed, the wildcard strings you're using in the question build some kind of regular language, and can be translated to corresponding regular expressions
Not sure that I fully understand your question, but if you're looking for performance, avoid regular expressions. Instead you can split the string on %. Then, take a look at the first and last matches:
// Anything before % should match at start of the string
targetString.indexOf(splits[0]) === 0;
// Anything after % should match at the end of the string
targetString.indexOf(splits[1]) + splits[1].length === targetString.length;
If you can use % multiple times within the string, then the first and last splits should follow the above rules. Anything else just needs to be in the string, and .indexOf is how you can check that.
I came to realize that this is impossible with a regular language, and therefore the only solution to this problem is to replace the wildcard symbol % with .* and then match two regular expressions with each other. This can however not be done by traditional regular expressions, look at this SO-question and it's answers for details.
Or perhaps you should edit the underlying Regular Expression engine for supporting wildcard based strings. Anyone being able to answer this question by extending the default implementation will be accepted as answer to this question ;-)

POSIX Regular Expressions: Excluding a word in an expression?

I am trying to create a regular expression using POSIX (Extended) Regular Expressions that I can use in my C program code.
Specifically, I have come up with the following, however, I want to exclude the word "http" within the matched expressions. Upon some searching, it doesn't look like POSIX makes it obvious for catching specific strings. I am using something called a "negative look-a-head" in the below example (i.e. the (?!http:) ). However, I fear that this may only be something available to regular expressions defined in dialects other than POSIX.
Is negative lookahead allowed? Is the logical NOT operator allowed in POSIX (i.e. ! )?
Working regular expression example:
href|HREF|src[[:space:]]=[[:space:]]\"(?!http:)[^\"]+\"[/]
If I cannot use negative-lookahead like in other dialects, what can I do to the above regular expression to filter out the specific word "http:"? Ideally, is there any way without inverse logic and ultimately creating a ridiculously long regular expression in the process? (the one I have above is quite long already, I'd rather it not look more confusing if possible)
[NOTE: I have consulted other related threads in Stack Overflow, but the most relevant ones seem to only ask this question "generically", which means answers given didn't necessarily mean they were POSIX-flavored ==> in another thread or two, I've seen the above (?!insertWordToExcludeHere) negative lookahead, but I fear it's only for PHP.)
[NOTE 2: I will take any POSIX regular expression phrasings as well, any help would be appreciated. Does anyone have a suggestion on how whatever regular expression that would filter out "http:" would look like and how it could be fit into my current regular expression, replacing the (?!http:)?]
According to http://www.regular-expressions.info/refflavors.html lookaheads and lookbehinds are not in the POSIX flavour.
You may consider thinking in terms of lexing (tokenization) and parsing if your problem is too complex to be represented cleanly as a regex.

Construction of pattern that doesn't contain binary string

I was trying to write a pattern which doesn't contain binary string (let's assume 101). I know that such expressions cannot be written using Regular Expression considering http://en.wikipedia.org/wiki/Regular_language.
I tried writing the pattern for the above problem using Regular Expression though and it seems to be working.
\b(?!101)\w+\b
What I wanted to ask is that can a regular expression be written for my problem and why? And if yes, then is my regular expression correct?
To match a whole string that doesn't contain 101:
^(?!.*101).*$
Look-ahead are indeed an easy way to check a condition on a string through regex, but your regex will only match alphanumeric words that do not start with 101.
You wrote
I know that such expressions cannot be written using Regular
Expression considering http://en.wikipedia.org/wiki/Regular_language.
In that Wikipedia article, you seem to have missed the
Note that the "regular expression" features provided with many
programming languages are augmented with features that make them
capable of recognizing languages that can not be expressed by the
formal regular expressions (as formally defined below).
The negative lookahead construct is such a feature.

How to write the regex for this expression

I want to match strings like this: !! so I suppose the input have the right elements but whether the they are evaluable, that is left for the evaluator!
1+(2-3)*(4/5)
what is the regex for matching this, something like this: ([0-9\+-\*/\(\)]+)? but this seems not working.
If you are only asking for a character validation, you can use
^[0-9+*/()-]*$
You don't need to escape characters in a character class (inside square brackets). And if you must include an hyphen, you HAVE to put it at the end, otherwise it would be considered as the character range operator.
That said, keep in mind this will only guarantee you that you have no other characters. It will NOT validate the structure (regexes are not the right tool for that). However, since you stated an evaluator will then process the input, that might be right for you.
You can't, this is not a regular language. Though some regexp implementations may provide additional features to match balanced parenthesis.
Regular expressions can not match arbitrary arithmetic formulas. Regexps only describe regular languages, while arithmetic formulas use a recursive grammar. See http://en.wikipedia.org/wiki/Regular_expression#Formal_language_theory
A regex may be possible if you limit nesting depth, but if you want it all the way, with matching bracket detection, it will probably be very, very complicated.
If you want to match "1+(2-3)*(4/5)", then you can use this regular expression.
/1+\(2-3\)\*\(4\/5)/
What's that? That doesn't tell you what you want to know? Well, then what do you want to know? What information are you trying to extract from the string?
You can't just say "strings like this". Your question is not nearly enough clear.
If your question is to evaluate if a equation is valid then you will need a parser to Tokenize the expression than a grammar to evaluate if the expression is right.
You cant check if the equation as balanced parenthesis using regex. This is because a regular expression is equivalent to a Deterministic Finite Automata. Since the automata is finite, you will never have a automata big enough to check parenthesis.

Regular expression listing all possibilities

Given a regular expression, how can I list all possible matches?
For example: AB[CD]1234, I want it to return a list like:
ABC1234
ABD1234
I searched the web, but couldn't find anything.
Exrex can do this:
$ python exrex.py 'AB[CD]1234'
ABC1234
ABD1234
The reason you haven't found anything is probably because this is a problem of serious complexity given the amount of combinations certain expressions would allow. Some regular expressions could even allow infite matches:
Consider following expressions:
AB[A-Z0-9]{1,10}1234
AB.*1234
I think your best bet would be to create an algorithm yourself based on a small subset of allowed patterns. In your specific case, I would suggest to use a more naive approach than a regular expression.
For some simple regular expressions like the one you provided (AB[CD]1234), there is a limited set of matches. But for other expressions (AB[CD]*1234) the number of possible matches are not limited.
One method for locating all the posibilities, is to detect where in the regular expression there are choices. For each possible choice generate a new regular expression based on the original regular expression and the current choice. This new regular expression is now a bit simpler than the original one.
For an expression like "A[BC][DE]F", the method will proceed as follows
getAllMatches("A[BC][DE]F")
= getAllMatches("AB[DE]F") + getAllMatches("AC[DE]F")
= getAllMatches("ABDF") + getAllMatches("ABEF")
+ getAllMatches("ACDF")+ getAllMatches("ACEF")
= "ABDF" + "ABEF" + "ACDF" + "ACEF"
It's possible to write an algorithm to do this but it will only work for regular expressions that have a finite set of possible matches. Your regexes would be limited to using:
Optional: ?
Characters: . \d \D
Sets: like [1a-c]
Negated sets: [^2-9d-z]
Alternations: |
Positive lookarounds
So your regexes could NOT use:
Repeaters: * +
Word patterns: \w \W
Negative lookarounds
Some zero-width assertions: ^ $
And there are some others (word boundaries, lazy & greedy quantifiers) I'm not sure about yet.
As for the algorithm itself, another user posted a link to this answer which describes how to create it.
Well you could convert the regular expression into an equivalent finite state machine (is relatively simple and can be done algorithmly) and then recursively folow every possible path through that fsm, outputting the followed paths through the machine. It's neither very hard nor computer intensive per output (you will normally get a HUGE amount of output however). You should however take care to disallow potentielly infinite passes (like .*). This can be done by having a maximum allowed path length, after which the tracing is aborted
A regular expression is intended to do nothing more than match to a pattern, that being said, the regular expression will never 'list' anything, only match. If you want to get a list of all matches I believe you will need to do it on your own.
Impossible.
Really.
Consider look ahead assertions. And what about .*, how will you generate all possible strings that match that regex?
It may be possible to find some code to list all possible matches for something as simple as you are doing. But most regular expressions you would not even want to attempt listing all possible matches.
For example AB.*1234 would be AB followed by absolutely anything and then 1234.
I'm not entirely sure this is even possible, but if it were, it would be so cpu/time intensive for many situations that it would not be useful.
For instance, try to make a list of all matches for A.*Z
There are sites that help with building a good regular expression though:
http://www.fileformat.info/tool/regex.htm
http://www.regular-expressions.info/javascriptexample.html
http://www.regextester.com/