Tools that turn regular phrases into regular expressions? - regex

I know there are a lot of tools that allow you to create regular expressions and test regular phrases against them, but is there a tool that allows you to type just a regular phrase or word, etc and it will generate the regular expression for you. For example, typing:
xyz555.. would generate the correct regular expression. It may not be the most ideal expression, but it would be a useful learning tool.

Because such analysis can't be done deterministically. It's impossible to take a single sample (or any particular number of samples) and generate a pattern.
For example, your example data could mean three alphabetic characters followed by three numeric characters...
...or it could be any number of alphabetic characters followed by three numerics
...or three alphabetic followed by three '5' characters.
It's impossible to determine exactly what the pattern is when more than one pattern fits the data.

Related

What's the regular expression for an alphabet without the first occurrence of a letter?

I am trying to use FLEX to recognize some regular expressions that I need.
What I am looking for is given a set of characters, say [A-Z], I want a regular expression that can match the first letter no matter what it is, followed by a second letter that can be anything in [A-Z] besides the first letter.
For example, if I give you AB, you match it but if I give you AA you don't. So I am kind of looking for a regex that's something like
[A-Z][A-Z^Besides what was picked in the first set].
How could this be implemented for more occurrences of letters? Say if I want to match 3 letters without each new letter being anything from the previous ones. For instance ABC but not AAB.
Thank you!
(Mathematical) regular expressions have no context. In (f)lex -- where regular expressions are actually regular, unlike most regex libraries -- there is no such thing as a back-reference, positive or negative.
So the only way to accomplish your goal with flex patterns is to enumerate the possibilities, which is tedious for two letters and impractical for more. The two letter case would be something like (abbreviated);
A[B-Z]|B[AC-Z]|C[ABD-Z]|D[A-CE-Z]|…|Z[A-Y]
The inverse expression also has 26 cases but is easier to type (and read). You could use (f)lex's first-longest-match rule to make use of it:
AA|BB|CC|DD|…|ZZ { /* Two identical letters */ }
[[:upper:]]{2} { /* This is the match */ }
Probably, neither of those is the best solution. However, I don't think I can give better advice without knowing more specifics. The key is knowing what action you want to take if the letters do match, which you don't specify. And what the other patterns are. (Recall that a lexical scanner is intended to divide the input into tokens, although you are free to ignore a token once it is identified.)
Flex does come with a number of useful features which can be used for more flexible token handling, including yyless (to rescan part or all of the token), yymore (to combine the match with the next token), and unput (to insert a character into the input stream). There is also REJECT, but you should try other solutions first. See the flex manual chapter on actions for more details.
So the simplest solution might be to just match any two capital letters, and then in the action check whether or not they are the same.

Can I write a regular expression that checks two lengths are equal?

I want to match strings with two numbers of equal length, like : 42-42, 0-2, 12345-54321.
I don't want to match strings where the two numbers have different lengths, like : 42-1, 000-0000.
The two parts (separated by the hyphen) must have the same length.
I wonder if it is possible to do a regexp like [0-9]{n}-[0-9]{n} with n variable but equal?
If there is no clean way to that in one pattern (I must put that in the pattern attribute of a HTML form input), I will do something like /\d-\d|\d{2}-\d{2}|\d{3}-\d{3}|<etc>/ up to the maximum length (16 in my case).
This is not possible with regular expressions, because this is neither a type-3 grammatic (can be done with regular expression) nor a type-2 grammatic (can be done with regular expressions, which support recursion).
The higher grammar levels (type-1 grammatic and type-0 grammatic) can only be parsed using a Turing machine (or something compatible like your programming language).
More about this can be found here:
https://en.wikipedia.org/wiki/Chomsky_hierarchy#The_hierarchy
Using a programming language, you need to count the first sequence of digits, check for the minus and then check if the same amount of digits follows.
Without the minus symbol, this would be a type-2 grammatic and could be solved using a recursive regular expression (even if the right sequence shall not contain digits), like this: ^(\d(?1)\d)$
So you need to write your own, non-regular-expression check code.
You should probably split the String around the separator and compare the length of both parts.
The tool of choice in regex to use when specifying "the same thing than before" are back-references, however they reference the matched value rather than the matching pattern : no way of using a back-reference to .{3} to match any 3 characters.
However, if you only need to validate a finite number of lengths, it can be (painfully) done with alternation :
\d-\d will match up to 1 character on both sides of the separator
\d-\d|\d{2}-\d{2} will match up to 2 characters on both sides of the separator
...

Regular Expression to match a number followed by that many characters?

I would like to match strings that follow this pattern: "N: N-character-string"
Valid examples:
5. Fives
12. AbcdAbcdAbcd
1. O
0.
3. Tre
Is there a way to accomplish this with a single regex? I'm happy to accept any flavor of regular expression.
No You can't do this with regex.
Finite automation (underlying data structure used by regex) has no support for memory. That is, on the lexical analysis done by your regex, your input is broken down into tokens and you can not use token from a previous stage to be used for further parsing on later stage.
Read Theory of automata for more theoretical background to this.

Regex character interval with exception

Say I have an interval with characters ['A'-'Z'], I want to match every of these characters except the letter 'F' and I need to do it through the ^ operator. Thus, I don't want to split it into two different intervals.
How can I do it the best way? I want to write something like ['A'-'Z']^'F' (All characters between A-Z except the letter F). This site can be used as reference: http://regexr.com/
EDIT: The relation to ocaml is that I want to define a regular expression of a string literal in ocamllex that starts/ends with a doublequote ( " ) and takes allowed characters in a certain range. Therefore I want to exclude the doublequotes because it obviously ends the string. (I am not considering escaped characters for the moment)
Since it is very rare to find two regular expressions libraries / processors with exactly the same regular expression syntax, it is important to always specify precisely which system you are using.
The tags in the question lead me to believe that you might be using ocamllex to build a scanner. In that case, according to the documentation for its regular expression syntax, you could use
['A'-'Z'] # 'F'
That's loosely based on the syntax used in flex:
[A-Z]{-}[F]
Java and Ruby regular expressions include a similar operator with very different syntax:
[A-Z&&[^F]]
If you are using a regular expression library which includes negative lookahead assertions (Perl, Python, Ecmascript/C++, and others), you could use one of those:
(?!F)[A-Z]
Or you could use a positive lookahead assertion combined with a negated character class:
(?=[A-Z])[^F]
In this simple case, both of those constructions effectively do a conjunction, but lookaround assertions are not really conjunctions. For a regular expression system which does implement a conjunction operator, see, for example, Ragel.
The ocamllex syntax for character set difference is:
['A'-'Z'] # 'F'
which is equivalent to
['A'-'E' 'G'-'Z']
(?!F)[A-Z] or ((?!F)[A-Z])*
This will match every uppercase character excluding 'F'
Use character class subtraction:
[A-Z&&[^F]]
The alternative of [A-EG-Z] is "OK" for a single exception, but breaks down quickly when there are many exceptions. Consider this succinct expression for consonants (non-vowels):
[B-Z&&[^EIOU]]
vs this train wreck
[B-DF-HJ-NP-TV-Z]
The regex below accomplishes what you want using ^ and without splitting into different intervals. It also resambles your original thought (['A'-'Z']^'F').
/(?=[A-Z])[^F]/ig
If only uppercase letters are allowed simple remove the i flag.
Demo

Java tool for matching multiple regular expressions with priorities to multiple strings

I have an unlimited sequence of strings and numerous regular expressions ordered by priorities. For each string in a sequence I have to to find the first matching regular expression and the matched substring. Strings are not very long (<1Kb) while the number of regular expressions may vary from hundreds to thousands.
I'm looking for a Java tool that would do this job efficiently. I guess the technique should be building DFA ahead.
My current option is JFLEX. The problem I can't workaround in JFLEX is that its rules have no priorities and JFLEX looks for the rule matching the longest part of text.
My question is whether my problem could be solved with JFLEX? If not, can you suggest another Java tool/technique that would do?
You could use Java regexp's. Build up the alternatives into a RE string with each alternative surrounded with '(' and ')+?' and separated by '|', with the highest priority REs first. The first construct makes the sub-REs greedy so they won't backtrack and '|' alternatives are evaluated left-to-right so the highest priority REs will be tried first.
For example, given a string of "zeroonetwothreefour"
'(one)+?|(onetwo)+?' will match 'one'
'(onetwo)+?|(one)+?' will match 'onetwo'
'(twothree)+?|(onetwothree)+?' will match 'twothree'
Note especially that in the last example, 'twothree' matches even though it occurs later in the target string and is shorter than the 'onetwothree' match.