How to match Regular Expression with String containing a wildcard character? - regex

Regular expression:
/Hello .*, what's up?/i
String which may contain any number of wildcard characters (%):
"% world, what's up?" (matches)
"Hello world, %?" (matches)
"Hello %, what's up?" (matches)
"Hey world, what's up?" (no match)
"Hello %, blabla." (no match)
I have thought of a solution myself, but I'd like to see what you are able to come up with (considering performance is a high priority). A requirement is the ability to use any regular expression; I only used .* in the example, but any valid regular expression should work.

A little automata theory might help you here. You say
this is a simplified version of matching a regular expression with a regular expression[1]
Actually, that does not seem to be the case. Instead of matching the text of a regular expression, you want to find regular expressions that can match the same string as a given regular expression.
Luckily, this problem is solvable :-) To see whether such a string exists, you would need to compute the union of the two regular languages and test whether the result is not the empty language. This might be a non-trivial problem and solving it efficiently [enough] may be hard, but standard algorithms for this do already exist. Basically you would need to translate the expression into a NFA, that one into a DFA which you then can union.
[1]: Indeed, the wildcard strings you're using in the question build some kind of regular language, and can be translated to corresponding regular expressions

Not sure that I fully understand your question, but if you're looking for performance, avoid regular expressions. Instead you can split the string on %. Then, take a look at the first and last matches:
// Anything before % should match at start of the string
targetString.indexOf(splits[0]) === 0;
// Anything after % should match at the end of the string
targetString.indexOf(splits[1]) + splits[1].length === targetString.length;
If you can use % multiple times within the string, then the first and last splits should follow the above rules. Anything else just needs to be in the string, and .indexOf is how you can check that.

I came to realize that this is impossible with a regular language, and therefore the only solution to this problem is to replace the wildcard symbol % with .* and then match two regular expressions with each other. This can however not be done by traditional regular expressions, look at this SO-question and it's answers for details.
Or perhaps you should edit the underlying Regular Expression engine for supporting wildcard based strings. Anyone being able to answer this question by extending the default implementation will be accepted as answer to this question ;-)

Related

Construction of pattern that doesn't contain binary string

I was trying to write a pattern which doesn't contain binary string (let's assume 101). I know that such expressions cannot be written using Regular Expression considering http://en.wikipedia.org/wiki/Regular_language.
I tried writing the pattern for the above problem using Regular Expression though and it seems to be working.
\b(?!101)\w+\b
What I wanted to ask is that can a regular expression be written for my problem and why? And if yes, then is my regular expression correct?
To match a whole string that doesn't contain 101:
^(?!.*101).*$
Look-ahead are indeed an easy way to check a condition on a string through regex, but your regex will only match alphanumeric words that do not start with 101.
You wrote
I know that such expressions cannot be written using Regular
Expression considering http://en.wikipedia.org/wiki/Regular_language.
In that Wikipedia article, you seem to have missed the
Note that the "regular expression" features provided with many
programming languages are augmented with features that make them
capable of recognizing languages that can not be expressed by the
formal regular expressions (as formally defined below).
The negative lookahead construct is such a feature.

Negation of a regular expression

I am not sure how it is called: negation, complementary or inversion. The concept is this. For example having alphabet "ab"
R = 'a'
!R = the regexp that matche everyhting exept what R matches
In this simple example it should be soemthing like
!R = 'b*|[ab][ab]+'
How is such a regexp called? I remeber from my studies that there is a way to calculate that, but it is something complicated and generally too hard to make by hand. Is there a nice online tool (or regular software) to do that?
jbo5112's answer gives good practical help. However, on the theoretical side: a regular expression corresponds to a regular language, so the term you're looking for is complementation.
To complement a regex:
Convert into the equivalent NFA. This is a well-known and defined process.
Convert the NFA to a DFA via the powerset construction
Complement the DFA by making accept states not accept and vice versa.
Convert the DFA to a regular expression.
You now have the complement of the original regular expression!
If all you're doing is searching, then some software/languages for regular expressions have a way to negate the match built in. For example, with grep you can use a '-v' option to get lines that don't match and the SQL variants I've seen allow you to use a 'not' qualifier to negate the match.
Another option that some/most/all regex dialects support is to use "negative lookahead". You may have to look up your specific syntax, but it's an interesting tool that is well worth reading about. Generally it's something like this: if R='<regex>', then Negative_of_R='(?!<regex>)'. Unfortunately, it can vary with the peculiarities of your language (e.g. vim uses \(<regex>\)\#!).
A word of caution: If you're not careful, a negated regular expression will match more than you expect. If you have the text This doesn't match 'mystring'. and search for (?!mystring), then it will match everything except the 'm' in mystring.

The Greedy Option of Regex is really needed?

The Greedy Option of Regex is really needed?
Lets say I have following texts, I like to extract texts inside [Optionx] and [/Optionx] blocks
[Option1]
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
[/Option2]
But with Regex Greedy Option, its give me
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
Anybody need like that? If yes, could you let me know?
If I understand correctly, the question is “why (when) do you need greedy matching?”
The answer is – almost always. Consider a regular expression that matches a sequence of arbitrary – but equal – characters, of length at least two. The regular expression would look like this:
(.)\1+
(\1 is a back-reference that matches the same text as the first parenthesized expression).
Now let’s search for repeats in the following string: abbbbbc. What do we find? Well, if we didn’t have greedy matching, we would find bb. Probably not what we want. In fact, in most application s we would be interested in finding the whole substring of bs, bbbbb.
By the way, this is a real-world example: the RLE compression works like that and can be easily implemented using regex.
In fact, if you examine regular expressions all around you will see that a lot of them use quantifiers and expect them to behave greedily. The opposite case is probably a minority. Often, it makes no difference because the searched expression is inside guard clauses (e.g. a quoted string is inside the quote marks) but like in the example above, that’s not always the case.
Regular expressions can potentially match multiple portion of a text.
For example consider the expression (ab)*c+ and the string "abccababccc". There are many portions of the string that can match the regular expressions:
(abc)cababccc
(abcc)ababccc
abcc(ababccc)
abccab(abccc)
ab(c)cababccc
ab(cc)ababccc
abcabab(c)ccc
....
some regular expressions implementation are actually able to return the entire set of matches but it is most common to return a single match.
There are many possible ways to determine the "winning match". The most common one is to take the "longest leftmost match" which results in the greedy behaviour you observed.
This is tipical of search and replace (a la grep) when with a+ you probably mean to match the entire aaaa rather than just a single a.
Choosing the "shortest non-empty leftmost" match is the usual non-greedy behaviour. It is the most useful when you have delimiters like your case.
It all depends on what you need, sometimes greedy is ok, some other times, like the case you showed, a non-greedy behaviour would be more meaningful. It's good that modern implementations of regular expressions allow us to do both.
If you're looking for text between the optionx blocks, instead of searching for .+, search for anything that's not "[\".
This is really rough, but works:
\[[^\]]+]([^(\[/)]+)
The first bit searches for anything in square brackets, then the second bit searches for anything that isn't "[\". That way you don't have to care about greediness, just tell it what you don't want to see.
One other consideration: In many cases, greedy and non-greedy quantifiers result in the same match, but differ in performance:
With a non-greedy quantifier, the regex engine needs to backtrack after every single character that was matched until it finally has matched as much as it needs to. With a greedy quantifier, on the other hand, it will match as much as possible "in one go" and only then backtrack as much as necessary to match any following tokens.
Let's say you apply a.*c to
abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc. This finds a match in 5 steps of the regex engine. Now apply a.*?c to the same string. The match is identical, but the regex engine needs 101 steps to arrive at this conclusion.
On the other hand, if you apply a.*c to abcbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb, it takes 101 steps whereas a.*?c only takes 5.
So if you know your data, you can tailor your regex to match it as efficiently as possible.
just use this algorithm which you can use in your fav language. No need regex.
flag=0
open file for reading
for each line in file :
if check "[/Option" in line:
flag=0
if check "[Option" in line:
flag=1
continue
if flag:
print line.strip()
# you can store the values of each option in this part

Regular expression listing all possibilities

Given a regular expression, how can I list all possible matches?
For example: AB[CD]1234, I want it to return a list like:
ABC1234
ABD1234
I searched the web, but couldn't find anything.
Exrex can do this:
$ python exrex.py 'AB[CD]1234'
ABC1234
ABD1234
The reason you haven't found anything is probably because this is a problem of serious complexity given the amount of combinations certain expressions would allow. Some regular expressions could even allow infite matches:
Consider following expressions:
AB[A-Z0-9]{1,10}1234
AB.*1234
I think your best bet would be to create an algorithm yourself based on a small subset of allowed patterns. In your specific case, I would suggest to use a more naive approach than a regular expression.
For some simple regular expressions like the one you provided (AB[CD]1234), there is a limited set of matches. But for other expressions (AB[CD]*1234) the number of possible matches are not limited.
One method for locating all the posibilities, is to detect where in the regular expression there are choices. For each possible choice generate a new regular expression based on the original regular expression and the current choice. This new regular expression is now a bit simpler than the original one.
For an expression like "A[BC][DE]F", the method will proceed as follows
getAllMatches("A[BC][DE]F")
= getAllMatches("AB[DE]F") + getAllMatches("AC[DE]F")
= getAllMatches("ABDF") + getAllMatches("ABEF")
+ getAllMatches("ACDF")+ getAllMatches("ACEF")
= "ABDF" + "ABEF" + "ACDF" + "ACEF"
It's possible to write an algorithm to do this but it will only work for regular expressions that have a finite set of possible matches. Your regexes would be limited to using:
Optional: ?
Characters: . \d \D
Sets: like [1a-c]
Negated sets: [^2-9d-z]
Alternations: |
Positive lookarounds
So your regexes could NOT use:
Repeaters: * +
Word patterns: \w \W
Negative lookarounds
Some zero-width assertions: ^ $
And there are some others (word boundaries, lazy & greedy quantifiers) I'm not sure about yet.
As for the algorithm itself, another user posted a link to this answer which describes how to create it.
Well you could convert the regular expression into an equivalent finite state machine (is relatively simple and can be done algorithmly) and then recursively folow every possible path through that fsm, outputting the followed paths through the machine. It's neither very hard nor computer intensive per output (you will normally get a HUGE amount of output however). You should however take care to disallow potentielly infinite passes (like .*). This can be done by having a maximum allowed path length, after which the tracing is aborted
A regular expression is intended to do nothing more than match to a pattern, that being said, the regular expression will never 'list' anything, only match. If you want to get a list of all matches I believe you will need to do it on your own.
Impossible.
Really.
Consider look ahead assertions. And what about .*, how will you generate all possible strings that match that regex?
It may be possible to find some code to list all possible matches for something as simple as you are doing. But most regular expressions you would not even want to attempt listing all possible matches.
For example AB.*1234 would be AB followed by absolutely anything and then 1234.
I'm not entirely sure this is even possible, but if it were, it would be so cpu/time intensive for many situations that it would not be useful.
For instance, try to make a list of all matches for A.*Z
There are sites that help with building a good regular expression though:
http://www.fileformat.info/tool/regex.htm
http://www.regular-expressions.info/javascriptexample.html
http://www.regextester.com/

Meaning of "match" as related to Regular Expressions

I'm writing a term paper on regular expressions and I'm a bit confused regarding the way one uses the word "match" when referring to regexes. Which of the following is the correct wording to use:
"The regular expression matches the string"
or
"The string matches the regular expression"
Or are they both correct? All opinions on this are welcome! I really want to get this right and I think it would help my understanding greatly to get this clarified.
I think both are correct. It depends on what you're focusing on. If your focus is in the regular expression itself to see if it serves to work on a given string or set of strings, then you use the first sentence. In the contrary, if you are more interested in looking at a set of strings that match certain criteria, the second one is applicable. You know, a match has the meaning of some equivalence under certain conditions, so both sentences sound equivalent to me.
The string is being matched to the regular expression pattern, therefore I would say the latter is more accurate
When two things match, it is (from a logical perspective at least) irrelevant in which order you mention them.
So it depends on what you want to put focus on.
The string matches the regular expression: Focus is on the string.
The regular expression matches the string: Focus is on the regex.
The latter sounds better to me. The regex specifies a pattern that the string may match. But there's nothing really wrong with either.
If you said either one to me, I would understand what you're saying. I'm sure people have said both to me, and I never thought either one needed to be corrected.
I agree that the string matches (or not) the regular expression. To make it clear why I'd say: the regular expression defines a grammar, and a given string is either well-formed according to that grammar or not.
"The regular expression matches the string"
True if the RE matches the whole string (eg. using ^ $ or just happening to match everything). Otherwise, I would write: the regular expression has match(es) in the string.
"The string matches the regular expression"
Again, true if the regex matches everything, otherwise it sounds a bit odd.
But indeed, in the case of a whole match, the two sentences are equivalent.
Since you're looking for a regular expression within a string, it's more correct to say that you've found the regular expression since that's a one-way relationship.
But as to which matches which, that's a two way relationship and it doesn't really matter (in English, anyway - I can't vouch for other languages ), so either would be correct.
My preference would be to say that the string matches the regular expression, since the RE is the invariant part and the string changes. But that's a personal preference and is unlikely to have any bearing on reality :-)
"The string matches the regular expression" seems to be shorthand for "the string is in the language defined by and isomorphic to the regular expression."
"The regular expression matches the string" seems to be shorthand for "a parser automaton compiled from the regular expression will parse the string and halt in a final state."
I'd say:
At design time a user/develper creates a regular expression that matches a string.
At run time a regular expression engine finds a string that matches the regular expression.
(Not intended to be a definition, just an example of common usage.)
Since a regular expression represents a possibly infinite set of finite strings, I would say that it is most correct to write that "string s matches regular expression r". You could also say that "string s is member of the set generated by regular expression r".
Also, you should consider using the words accept and reject, especially if you intend to discuss finite automata in your paper.