Using SIMILAR TO for a regex? - regex

Why is the following instruction returning FALSE?
SELECT '[1-3]{5}' SIMILAR TO '22222' ;
I can't find what is wrong with that, according to the Postgres doc ...

Your basic error has already been answered.
More importantly, don't use SIMILAR TO at all. It's completely pointless:
Query performance in PostgreSQL using 'similar to'
Difference between LIKE and ~ in Postgres
Use LIKE, or the regular expression match operator ~, or other pattern matching operators:
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
For 5 digits between 1 and 3 use the expression #Thomas provided.
If you actually want 5 identical digits between 1 and 3 like your example suggests I suggest a back reference:
SELECT '22222' ~ '([1-9])\1{4}';
Related answer with more explanation:
Deleting records with number repeating more than 5
sqlfiddle demonstrating both.

The operator is defined as:
string SIMILAR TO pattern
so the first parameter is the string that you want to compare. The second parameter is the regex to compare against.
You need:
SELECT '22222' SIMILAR TO '[1-3]{5}';

try
SELECT '22222' ~ '[1-3]{5}'
SIMILAR is not POSIX standard
The SIMILAR TO operator returns true or false depending on whether its pattern matches the given string. It is similar to LIKE, except that it interprets the pattern using the SQL standard's definition of a regular expression. SQL regular expressions are a curious cross between LIKE notation and common regular expression notation.
...
POSIX regular expressions provide a more powerful means for pattern matching than the LIKE and SIMILAR TO operators. Many Unix tools such as egrep, sed, or awk use a pattern matching language that is similar to the one described here.
http://www.postgresql.org/docs/9.2/static/functions-matching.html#FUNCTIONS-POSIX-REGEXP

Related

How to match Regular Expression with String containing a wildcard character?

Regular expression:
/Hello .*, what's up?/i
String which may contain any number of wildcard characters (%):
"% world, what's up?" (matches)
"Hello world, %?" (matches)
"Hello %, what's up?" (matches)
"Hey world, what's up?" (no match)
"Hello %, blabla." (no match)
I have thought of a solution myself, but I'd like to see what you are able to come up with (considering performance is a high priority). A requirement is the ability to use any regular expression; I only used .* in the example, but any valid regular expression should work.
A little automata theory might help you here. You say
this is a simplified version of matching a regular expression with a regular expression[1]
Actually, that does not seem to be the case. Instead of matching the text of a regular expression, you want to find regular expressions that can match the same string as a given regular expression.
Luckily, this problem is solvable :-) To see whether such a string exists, you would need to compute the union of the two regular languages and test whether the result is not the empty language. This might be a non-trivial problem and solving it efficiently [enough] may be hard, but standard algorithms for this do already exist. Basically you would need to translate the expression into a NFA, that one into a DFA which you then can union.
[1]: Indeed, the wildcard strings you're using in the question build some kind of regular language, and can be translated to corresponding regular expressions
Not sure that I fully understand your question, but if you're looking for performance, avoid regular expressions. Instead you can split the string on %. Then, take a look at the first and last matches:
// Anything before % should match at start of the string
targetString.indexOf(splits[0]) === 0;
// Anything after % should match at the end of the string
targetString.indexOf(splits[1]) + splits[1].length === targetString.length;
If you can use % multiple times within the string, then the first and last splits should follow the above rules. Anything else just needs to be in the string, and .indexOf is how you can check that.
I came to realize that this is impossible with a regular language, and therefore the only solution to this problem is to replace the wildcard symbol % with .* and then match two regular expressions with each other. This can however not be done by traditional regular expressions, look at this SO-question and it's answers for details.
Or perhaps you should edit the underlying Regular Expression engine for supporting wildcard based strings. Anyone being able to answer this question by extending the default implementation will be accepted as answer to this question ;-)

POSIX Regular Expressions: Excluding a word in an expression?

I am trying to create a regular expression using POSIX (Extended) Regular Expressions that I can use in my C program code.
Specifically, I have come up with the following, however, I want to exclude the word "http" within the matched expressions. Upon some searching, it doesn't look like POSIX makes it obvious for catching specific strings. I am using something called a "negative look-a-head" in the below example (i.e. the (?!http:) ). However, I fear that this may only be something available to regular expressions defined in dialects other than POSIX.
Is negative lookahead allowed? Is the logical NOT operator allowed in POSIX (i.e. ! )?
Working regular expression example:
href|HREF|src[[:space:]]=[[:space:]]\"(?!http:)[^\"]+\"[/]
If I cannot use negative-lookahead like in other dialects, what can I do to the above regular expression to filter out the specific word "http:"? Ideally, is there any way without inverse logic and ultimately creating a ridiculously long regular expression in the process? (the one I have above is quite long already, I'd rather it not look more confusing if possible)
[NOTE: I have consulted other related threads in Stack Overflow, but the most relevant ones seem to only ask this question "generically", which means answers given didn't necessarily mean they were POSIX-flavored ==> in another thread or two, I've seen the above (?!insertWordToExcludeHere) negative lookahead, but I fear it's only for PHP.)
[NOTE 2: I will take any POSIX regular expression phrasings as well, any help would be appreciated. Does anyone have a suggestion on how whatever regular expression that would filter out "http:" would look like and how it could be fit into my current regular expression, replacing the (?!http:)?]
According to http://www.regular-expressions.info/refflavors.html lookaheads and lookbehinds are not in the POSIX flavour.
You may consider thinking in terms of lexing (tokenization) and parsing if your problem is too complex to be represented cleanly as a regex.

Negation of a regular expression

I am not sure how it is called: negation, complementary or inversion. The concept is this. For example having alphabet "ab"
R = 'a'
!R = the regexp that matche everyhting exept what R matches
In this simple example it should be soemthing like
!R = 'b*|[ab][ab]+'
How is such a regexp called? I remeber from my studies that there is a way to calculate that, but it is something complicated and generally too hard to make by hand. Is there a nice online tool (or regular software) to do that?
jbo5112's answer gives good practical help. However, on the theoretical side: a regular expression corresponds to a regular language, so the term you're looking for is complementation.
To complement a regex:
Convert into the equivalent NFA. This is a well-known and defined process.
Convert the NFA to a DFA via the powerset construction
Complement the DFA by making accept states not accept and vice versa.
Convert the DFA to a regular expression.
You now have the complement of the original regular expression!
If all you're doing is searching, then some software/languages for regular expressions have a way to negate the match built in. For example, with grep you can use a '-v' option to get lines that don't match and the SQL variants I've seen allow you to use a 'not' qualifier to negate the match.
Another option that some/most/all regex dialects support is to use "negative lookahead". You may have to look up your specific syntax, but it's an interesting tool that is well worth reading about. Generally it's something like this: if R='<regex>', then Negative_of_R='(?!<regex>)'. Unfortunately, it can vary with the peculiarities of your language (e.g. vim uses \(<regex>\)\#!).
A word of caution: If you're not careful, a negated regular expression will match more than you expect. If you have the text This doesn't match 'mystring'. and search for (?!mystring), then it will match everything except the 'm' in mystring.

emacs: Is it possible to match strings with balanced parens with emacs regex?

Something like this:
http://perl.plover.com/yak/regex/samples/slide083.html
In other words I want to match successfully on { { foo } { bar} } but not on { { foo } .
I see it's possible in perl, and in .NET. Is it possible in emacs regex?
No, so far Perl/PCRE and .NET are the only regex flavors that support arbitrary nesting (recursive patterns).
No, but if you have a particular use case to discuss you'll often find that you don't need regexes. Simple state-machines to match parenthases are pretty simple to write in lisp. Looking at the source of Paredit is a good place to start.
If you are still interested have a look at cexp.el.
It is just a hack but maybe serves your purpose.
You can search for combined regular and balanced expressions with cexp-search-forward.
The built-in re-search-forward is used for regular expressions and so its syntax rules apply. Balanced expressions can be matched with the additional syntax elements \!( and \!).
The most serious restriction is that balanced expressions may not occur in groups. So a construct like \!(^{ \(\!(^{.*}$\!)\)+ }$\!) does not work because of the group containing the inner balanced expression.
Nevertheless, one useful example is matching TeX-definitions like
\def\mdo#1{{\def\next{\relax}\def\tmp{#1}\ifx\next\tmp\else\def\next{#1\mdo}\expandafter}\next}
with combined expressions like
\\def\\[[:alpha:]]+\(#[0-9]\)*\!(^{.*}$\!)
The search via cexp-search-forward with the above cexp returns the limits for the following groups:
The beginning and the end of the full match
The limits of the match for the regular expression before the balanced expression, i.e. \def\mdo#1
The limits of the captured group in the first regular expression, i.e., #1
The limits of the balanced expression, i.e., {{\def\next{\relax}\def\tmp{#1}\ifx\next\tmp\else\def\next{#1\mdo}\expandafter}\next}

Regular expression listing all possibilities

Given a regular expression, how can I list all possible matches?
For example: AB[CD]1234, I want it to return a list like:
ABC1234
ABD1234
I searched the web, but couldn't find anything.
Exrex can do this:
$ python exrex.py 'AB[CD]1234'
ABC1234
ABD1234
The reason you haven't found anything is probably because this is a problem of serious complexity given the amount of combinations certain expressions would allow. Some regular expressions could even allow infite matches:
Consider following expressions:
AB[A-Z0-9]{1,10}1234
AB.*1234
I think your best bet would be to create an algorithm yourself based on a small subset of allowed patterns. In your specific case, I would suggest to use a more naive approach than a regular expression.
For some simple regular expressions like the one you provided (AB[CD]1234), there is a limited set of matches. But for other expressions (AB[CD]*1234) the number of possible matches are not limited.
One method for locating all the posibilities, is to detect where in the regular expression there are choices. For each possible choice generate a new regular expression based on the original regular expression and the current choice. This new regular expression is now a bit simpler than the original one.
For an expression like "A[BC][DE]F", the method will proceed as follows
getAllMatches("A[BC][DE]F")
= getAllMatches("AB[DE]F") + getAllMatches("AC[DE]F")
= getAllMatches("ABDF") + getAllMatches("ABEF")
+ getAllMatches("ACDF")+ getAllMatches("ACEF")
= "ABDF" + "ABEF" + "ACDF" + "ACEF"
It's possible to write an algorithm to do this but it will only work for regular expressions that have a finite set of possible matches. Your regexes would be limited to using:
Optional: ?
Characters: . \d \D
Sets: like [1a-c]
Negated sets: [^2-9d-z]
Alternations: |
Positive lookarounds
So your regexes could NOT use:
Repeaters: * +
Word patterns: \w \W
Negative lookarounds
Some zero-width assertions: ^ $
And there are some others (word boundaries, lazy & greedy quantifiers) I'm not sure about yet.
As for the algorithm itself, another user posted a link to this answer which describes how to create it.
Well you could convert the regular expression into an equivalent finite state machine (is relatively simple and can be done algorithmly) and then recursively folow every possible path through that fsm, outputting the followed paths through the machine. It's neither very hard nor computer intensive per output (you will normally get a HUGE amount of output however). You should however take care to disallow potentielly infinite passes (like .*). This can be done by having a maximum allowed path length, after which the tracing is aborted
A regular expression is intended to do nothing more than match to a pattern, that being said, the regular expression will never 'list' anything, only match. If you want to get a list of all matches I believe you will need to do it on your own.
Impossible.
Really.
Consider look ahead assertions. And what about .*, how will you generate all possible strings that match that regex?
It may be possible to find some code to list all possible matches for something as simple as you are doing. But most regular expressions you would not even want to attempt listing all possible matches.
For example AB.*1234 would be AB followed by absolutely anything and then 1234.
I'm not entirely sure this is even possible, but if it were, it would be so cpu/time intensive for many situations that it would not be useful.
For instance, try to make a list of all matches for A.*Z
There are sites that help with building a good regular expression though:
http://www.fileformat.info/tool/regex.htm
http://www.regular-expressions.info/javascriptexample.html
http://www.regextester.com/