POSIX Regular Expressions: Excluding a word in an expression?

POSIX Regular Expressions: Excluding a word in an expression? - regex

I am trying to create a regular expression using POSIX (Extended) Regular Expressions that I can use in my C program code.
Specifically, I have come up with the following, however, I want to exclude the word "http" within the matched expressions. Upon some searching, it doesn't look like POSIX makes it obvious for catching specific strings. I am using something called a "negative look-a-head" in the below example (i.e. the (?!http:) ). However, I fear that this may only be something available to regular expressions defined in dialects other than POSIX.
Is negative lookahead allowed? Is the logical NOT operator allowed in POSIX (i.e. ! )?
Working regular expression example:
href|HREF|src[[:space:]]=[[:space:]]\"(?!http:)[^\"]+\"[/]
If I cannot use negative-lookahead like in other dialects, what can I do to the above regular expression to filter out the specific word "http:"? Ideally, is there any way without inverse logic and ultimately creating a ridiculously long regular expression in the process? (the one I have above is quite long already, I'd rather it not look more confusing if possible)
[NOTE: I have consulted other related threads in Stack Overflow, but the most relevant ones seem to only ask this question "generically", which means answers given didn't necessarily mean they were POSIX-flavored ==> in another thread or two, I've seen the above (?!insertWordToExcludeHere) negative lookahead, but I fear it's only for PHP.)
[NOTE 2: I will take any POSIX regular expression phrasings as well, any help would be appreciated. Does anyone have a suggestion on how whatever regular expression that would filter out "http:" would look like and how it could be fit into my current regular expression, replacing the (?!http:)?]

According to http://www.regular-expressions.info/refflavors.html lookaheads and lookbehinds are not in the POSIX flavour.
You may consider thinking in terms of lexing (tokenization) and parsing if your problem is too complex to be represented cleanly as a regex.

Related

How to match Regular Expression with String containing a wildcard character?

Regular expression:
/Hello .*, what's up?/i
String which may contain any number of wildcard characters (%):
"% world, what's up?" (matches)
"Hello world, %?" (matches)
"Hello %, what's up?" (matches)
"Hey world, what's up?" (no match)
"Hello %, blabla." (no match)
I have thought of a solution myself, but I'd like to see what you are able to come up with (considering performance is a high priority). A requirement is the ability to use any regular expression; I only used .* in the example, but any valid regular expression should work.

A little automata theory might help you here. You say
this is a simplified version of matching a regular expression with a regular expression[1]
Actually, that does not seem to be the case. Instead of matching the text of a regular expression, you want to find regular expressions that can match the same string as a given regular expression.
Luckily, this problem is solvable :-) To see whether such a string exists, you would need to compute the union of the two regular languages and test whether the result is not the empty language. This might be a non-trivial problem and solving it efficiently [enough] may be hard, but standard algorithms for this do already exist. Basically you would need to translate the expression into a NFA, that one into a DFA which you then can union.
[1]: Indeed, the wildcard strings you're using in the question build some kind of regular language, and can be translated to corresponding regular expressions

Not sure that I fully understand your question, but if you're looking for performance, avoid regular expressions. Instead you can split the string on %. Then, take a look at the first and last matches:
// Anything before % should match at start of the string
targetString.indexOf(splits[0]) === 0;
// Anything after % should match at the end of the string
targetString.indexOf(splits[1]) + splits[1].length === targetString.length;
If you can use % multiple times within the string, then the first and last splits should follow the above rules. Anything else just needs to be in the string, and .indexOf is how you can check that.

I came to realize that this is impossible with a regular language, and therefore the only solution to this problem is to replace the wildcard symbol % with .* and then match two regular expressions with each other. This can however not be done by traditional regular expressions, look at this SO-question and it's answers for details.
Or perhaps you should edit the underlying Regular Expression engine for supporting wildcard based strings. Anyone being able to answer this question by extending the default implementation will be accepted as answer to this question ;-)

Construction of pattern that doesn't contain binary string

I was trying to write a pattern which doesn't contain binary string (let's assume 101). I know that such expressions cannot be written using Regular Expression considering http://en.wikipedia.org/wiki/Regular_language.
I tried writing the pattern for the above problem using Regular Expression though and it seems to be working.
\b(?!101)\w+\b
What I wanted to ask is that can a regular expression be written for my problem and why? And if yes, then is my regular expression correct?

To match a whole string that doesn't contain 101:
^(?!.*101).*$
Look-ahead are indeed an easy way to check a condition on a string through regex, but your regex will only match alphanumeric words that do not start with 101.

You wrote
I know that such expressions cannot be written using Regular
Expression considering http://en.wikipedia.org/wiki/Regular_language.
In that Wikipedia article, you seem to have missed the
Note that the "regular expression" features provided with many
programming languages are augmented with features that make them
capable of recognizing languages that can not be expressed by the
formal regular expressions (as formally defined below).
The negative lookahead construct is such a feature.

Negation of a regular expression

I am not sure how it is called: negation, complementary or inversion. The concept is this. For example having alphabet "ab"
R = 'a'
!R = the regexp that matche everyhting exept what R matches
In this simple example it should be soemthing like
!R = 'b*|[ab][ab]+'
How is such a regexp called? I remeber from my studies that there is a way to calculate that, but it is something complicated and generally too hard to make by hand. Is there a nice online tool (or regular software) to do that?

jbo5112's answer gives good practical help. However, on the theoretical side: a regular expression corresponds to a regular language, so the term you're looking for is complementation.
To complement a regex:
Convert into the equivalent NFA. This is a well-known and defined process.
Convert the NFA to a DFA via the powerset construction
Complement the DFA by making accept states not accept and vice versa.
Convert the DFA to a regular expression.
You now have the complement of the original regular expression!

If all you're doing is searching, then some software/languages for regular expressions have a way to negate the match built in. For example, with grep you can use a '-v' option to get lines that don't match and the SQL variants I've seen allow you to use a 'not' qualifier to negate the match.
Another option that some/most/all regex dialects support is to use "negative lookahead". You may have to look up your specific syntax, but it's an interesting tool that is well worth reading about. Generally it's something like this: if R='<regex>', then Negative_of_R='(?!<regex>)'. Unfortunately, it can vary with the peculiarities of your language (e.g. vim uses \(<regex>\)\#!).
A word of caution: If you're not careful, a negated regular expression will match more than you expect. If you have the text This doesn't match 'mystring'. and search for (?!mystring), then it will match everything except the 'm' in mystring.

Aren't modern regular expression dialects regular?

I've seen a few comments here that mention that modern regular expressions go beyond what can be represented in a regular language. How is this so?
What features of modern regular expressions are not regular? Examples would be helpful.

The first thing that comes to mind is backreferences:
(\w*)\s\1
(matches a group of word characters, followed by a space character and then the same group previously matched) eg: hello hello matches, hello world doesn't.
This construct is not regular (ie: can't be generated by a regular grammar).
Another feature supported by Perl Compatible RegExp (PCRE) that is not regular are recursive patterns:
\((a*|(?R))*\)
This can be used to match any combination of balanced parentheses and "a"s (from wikipedia)

A couple of examples:
Regular expressions support grouping. E.g. in Ruby: /my (group)/.match("my group")[1] will output "group". storing something in a group requires an external storage, which a finite automaton does not have.
Many languages, e.g. C#, support captures, i.e. that each match will be captured on a stack - for example the pattern (?<MYGROUP>.)* could perform multiple captures of "." in the same group.
Grouping are used for backreferencing as pointed out by the user NullUserException above. Backreferencing requires one or more external stacks with the power of a push-down-automaton (you have to be able to push something on a stack and peek or pop it afterwards.
Some engines have the possibility of seperately pushing and popping external stacks and checking whether the stack is empty. In .NET, actually (?<MYGROUP>test) pushes a stack, while (?<-MYGROUP>) pops a stack.
Some engines like the .NET engine have a balanced grouping concept - where an external stack can be both pushed and popped at the same time. Balanced grouping syntax is (?<FIRSTGROUP-LASTGROUP>) which pops the LASTGROUP and pushes the capture since the LASTGROUP index on the FIRSTGROUP stack. This can actually be used to match infinitely nested constructions which is definitely beyond the power of a finite automaton.
Probably other good examples exist :-) If you are further interessted in some of the implementation details of external stacks in combination with Regex's and balanced grouping and thus higher order automata than finite automata, I once wrote two short articles on this (http://www.codeproject.com/KB/recipes/Nested_RegEx_explained.aspx and http://www.codeproject.com/KB/recipes/RegEx_Balanced_Grouping.aspx).
Anyway - finitieness or not - I blieve that the power that this extra stuff brings to the regular languages is great :-)
Br. Morten

A deterministic or nondeterministic finite automaton recognizes just the regular languages, which are described by regular expressions. The definition of a regular expression is simple. Let S be an alphabet. Then the empty set, the empty string, and every element of S are regular expressions (over S). Let u and v be regular expressions. Then the union (u | v), concatenation (uv), and closure (u*) of u and v are regular expressions over S. This definition is easily extended to the regular languages. No other expression is a regular expression. As pointed out, some back-references are an example. The Wikipedia pages on regular languages and expressions are good references.
In essence, certain "regular expressions" are not regular because no automaton of a particular type can be constructed to recognize them. For example, the the language
{ a^ i b^ i : i <= 0 }
is not regular. This is because the accepting automaton would require infinitely many states, but an automaton accepting regular languages must have a finite number of states.

Regular expression listing all possibilities

Given a regular expression, how can I list all possible matches?
For example: AB[CD]1234, I want it to return a list like:
ABC1234
ABD1234
I searched the web, but couldn't find anything.

Exrex can do this:
$ python exrex.py 'AB[CD]1234'
ABC1234
ABD1234

The reason you haven't found anything is probably because this is a problem of serious complexity given the amount of combinations certain expressions would allow. Some regular expressions could even allow infite matches:
Consider following expressions:
AB[A-Z0-9]{1,10}1234
AB.*1234
I think your best bet would be to create an algorithm yourself based on a small subset of allowed patterns. In your specific case, I would suggest to use a more naive approach than a regular expression.

For some simple regular expressions like the one you provided (AB[CD]1234), there is a limited set of matches. But for other expressions (AB[CD]*1234) the number of possible matches are not limited.
One method for locating all the posibilities, is to detect where in the regular expression there are choices. For each possible choice generate a new regular expression based on the original regular expression and the current choice. This new regular expression is now a bit simpler than the original one.
For an expression like "A[BC][DE]F", the method will proceed as follows
getAllMatches("A[BC][DE]F")
= getAllMatches("AB[DE]F") + getAllMatches("AC[DE]F")
= getAllMatches("ABDF") + getAllMatches("ABEF")
+ getAllMatches("ACDF")+ getAllMatches("ACEF")
= "ABDF" + "ABEF" + "ACDF" + "ACEF"

It's possible to write an algorithm to do this but it will only work for regular expressions that have a finite set of possible matches. Your regexes would be limited to using:
Optional: ?
Characters: . \d \D
Sets: like [1a-c]
Negated sets: [^2-9d-z]
Alternations: |
Positive lookarounds
So your regexes could NOT use:
Repeaters: * +
Word patterns: \w \W
Negative lookarounds
Some zero-width assertions: ^ $
And there are some others (word boundaries, lazy & greedy quantifiers) I'm not sure about yet.
As for the algorithm itself, another user posted a link to this answer which describes how to create it.

Well you could convert the regular expression into an equivalent finite state machine (is relatively simple and can be done algorithmly) and then recursively folow every possible path through that fsm, outputting the followed paths through the machine. It's neither very hard nor computer intensive per output (you will normally get a HUGE amount of output however). You should however take care to disallow potentielly infinite passes (like .*). This can be done by having a maximum allowed path length, after which the tracing is aborted

A regular expression is intended to do nothing more than match to a pattern, that being said, the regular expression will never 'list' anything, only match. If you want to get a list of all matches I believe you will need to do it on your own.

Impossible.
Really.
Consider look ahead assertions. And what about .*, how will you generate all possible strings that match that regex?

It may be possible to find some code to list all possible matches for something as simple as you are doing. But most regular expressions you would not even want to attempt listing all possible matches.
For example AB.*1234 would be AB followed by absolutely anything and then 1234.

I'm not entirely sure this is even possible, but if it were, it would be so cpu/time intensive for many situations that it would not be useful.
For instance, try to make a list of all matches for A.*Z
There are sites that help with building a good regular expression though:
http://www.fileformat.info/tool/regex.htm
http://www.regular-expressions.info/javascriptexample.html
http://www.regextester.com/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js