Random string that matches a regexp [duplicate] - regex

This question already has answers here:
Using Regex to generate Strings rather than match them
(12 answers)
Closed 1 year ago.
How would you go about creating a random alpha-numeric string that matches a certain regular expression?
This is specifically for creating initial passwords that fulfill regular password requirements.

Welp, just musing, but the general question of generating random inputs that match a regex sounds doable to me for a sufficiently relaxed definition of random and a sufficiently tight definition of regex. I'm thinking of the classical formal definition, which allows only ()|* and alphabet characters.
Regular expressions can be mapped to formal machines called finite automata. Such a machine is a directed graph with a particular node called the final state, a node called the initial state, and a letter from the alphabet on each edge. A word is accepted by the regex if it's possible to start at the initial state and traverse one edge labeled with each character through the graph and end at the final state.
One could build the graph, then start at the final state and traverse random edges backwards, keeping track of the path. In a standard construction, every node in the graph is reachable from the initial state, so you do not need to worry about making irrecoverable mistakes and needing to backtrack. If you reach the initial state, stop, and read off the path going forward. That's your match for the regex.
There's no particular guarantee about when or if you'll reach the initial state, though. One would have to figure out in what sense the generated strings are 'random', and in what sense you are hoping for a random element from the language in the first place.
Maybe that's a starting point for thinking about the problem, though!
Now that I've written that out, it seems to me that it might be simpler to repeatedly resolve choices to simplify the regex pattern until you're left with a simple string. Find the first non-alphabet character in the pattern. If it's a *, replicate the preceding item some number of times and remove the *. If it's a |, choose which of the OR'd items to preserve and remove the rest. For a left paren, do the same, but looking at the character following the matching right paren. This is probably easier if you parse the regex into a tree representation first that makes the paren grouping structure easier to work with.
To the person who worried that deciding if a regex actually matches anything is equivalent to the halting problem: Nope, regular languages are quite well behaved. You can tell if any two regexes describe the same set of accepted strings. You basically make the machine above, then follow an algorithm to produce a canonical minimal equivalent machine. Do that for two regexes, then check if the resulting minimal machines are equivalent, which is straightforward.

String::Random in Perl will generate a random string from a subset of regular expressions:
#!/usr/bin/perl
use strict;
use warnings;
use String::Random qw/random_regex/;
print random_regex('[A-Za-z]{3}[0-9][A-Z]{2}[!##$%^&*]'), "\n";

If you have a specific problem, you probably have a specific regular expression in mind. I would take that regular expression, work out what it means in simple human terms, and work from there.
I suspect it's possible to create a general regex random match generator, but it's likely to be much more work than just handling a specific case - even if that case changes a few times a year.
(Actually, it may not be possible to generate random matches in the most general sense - I have a vague memory that the problem of "does any string match this regex" is the halting problem in disguise. With a very cut-down regex language you may have more luck though.)

I have written Parsley, which consist of a Lexer and a Generator.
Lexer is for converting a regular expression-like string into a sequence of tokens.
Generator is using these tokens to produce a defined number of codes.
$generator = new \Gajus\Parsley\Generator();
/**
* Generate a set of random codes based on Parsley pattern.
* Codes are guaranteed to be unique within the set.
*
* #param string $pattern Parsley pattern.
* #param int $amount Number of codes to generate.
* #param int $safeguard Number of additional codes generated in case there are duplicates that need to be replaced.
* #return array
*/
$codes = $generator->generateFromPattern('FOO[A-Z]{10}[0-9]{2}', 100);
The above example will generate an array containing 100 codes, each prefixed with "FOO", followed by 10 characters from "ABCDEFGHKMNOPRSTUVWXYZ23456789" haystack and 2 numbers from "0123456789" haystack.

This PHP library looks promising: ReverseRegex
Like all of these, it only handles a subset of regular expressions but it can do fairly complex stuff like UK Postcodes:
([A-PR-UWYZ]([0-9]([0-9]|[A-HJKSTUW])?|[A-HK-Y][0-9]([0-9]|[ABEHMNPRVWXY])?) ?[0-9][ABD-HJLNP-UW-Z]{2}|GIR0AA)
Outputs
D43WF
B6 6SB
MP445FR
P9 7EX
N9 2DH
GQ28 4UL
NH1 2SL
KY2 9LS
TE4Y 0AP

You'd need to write a string generator that can parse regular expressions and generate random members of character ranges for random lengths, etc.
Much easier would be to write a random password generator with certain rules (starts with a lower case letter, has at least one punctuation, capital letter and number, at least 6 characters, etc) and then write your regex so that any passwords created with said rules are valid.

Presuming you have both a minimum length and 3-of-4* (or similar) requirement, I'd just be inclined to use a decent password generator.
I've built a couple in the past (both web-based and command-line), and have never had to skip more than one generated string to pass the 3-of-4 rule.
3-of-4: must have at least three of the following characteristics: lowercase, uppercase, number, symbol

It is possible (for example, Haskell regexp module has a test suite which automatically generates strings that ought to match certain regexes).
However, for a simple task at hand you might be better off taking a simple password generator and filtering its output with your regexp.

Related

What's the regular expression for an alphabet without the first occurrence of a letter?

I am trying to use FLEX to recognize some regular expressions that I need.
What I am looking for is given a set of characters, say [A-Z], I want a regular expression that can match the first letter no matter what it is, followed by a second letter that can be anything in [A-Z] besides the first letter.
For example, if I give you AB, you match it but if I give you AA you don't. So I am kind of looking for a regex that's something like
[A-Z][A-Z^Besides what was picked in the first set].
How could this be implemented for more occurrences of letters? Say if I want to match 3 letters without each new letter being anything from the previous ones. For instance ABC but not AAB.
Thank you!
(Mathematical) regular expressions have no context. In (f)lex -- where regular expressions are actually regular, unlike most regex libraries -- there is no such thing as a back-reference, positive or negative.
So the only way to accomplish your goal with flex patterns is to enumerate the possibilities, which is tedious for two letters and impractical for more. The two letter case would be something like (abbreviated);
A[B-Z]|B[AC-Z]|C[ABD-Z]|D[A-CE-Z]|…|Z[A-Y]
The inverse expression also has 26 cases but is easier to type (and read). You could use (f)lex's first-longest-match rule to make use of it:
AA|BB|CC|DD|…|ZZ { /* Two identical letters */ }
[[:upper:]]{2} { /* This is the match */ }
Probably, neither of those is the best solution. However, I don't think I can give better advice without knowing more specifics. The key is knowing what action you want to take if the letters do match, which you don't specify. And what the other patterns are. (Recall that a lexical scanner is intended to divide the input into tokens, although you are free to ignore a token once it is identified.)
Flex does come with a number of useful features which can be used for more flexible token handling, including yyless (to rescan part or all of the token), yymore (to combine the match with the next token), and unput (to insert a character into the input stream). There is also REJECT, but you should try other solutions first. See the flex manual chapter on actions for more details.
So the simplest solution might be to just match any two capital letters, and then in the action check whether or not they are the same.

Extracting static strings from a regular expression

I'm trying to efficiently extract static strings (strings that MUST be matched for a given regular expression to match). I've been able to do it in the simplest cases but I'm trying to discover a more robust solution.
Given a regex such as the one below
"fox jump(ed|ing|s)"
would give us
"fox,jumped,jumping,jumps"
Another example is
"fox jump(ed|ing|s)?"
which would give us
"fox,jump"
because of the optional operator
The algorithm I have is overly simple for now. It will start from the end of the regex and removes groups or a single character followed by these operators "* ?" as well as "explode" grouped OR operators "(|)". This has worked quite well but doesn't take into consideration the full syntax of a regex. You can think of it as kind of a minimal set generating process for a regex (the minimal set of strings that the regex can "generate/must match").
WHY?
I'm trying to match a bunch of text against a large set of regexes. If I can get a list of "keywords" for these regexes that is "required" I can do a quick text search for that keyword to filter the regexes I care about (ignore the ones I am guaranteed to not match or even skip that text entirely effectively not running any regexes on the text because we are guarenteed to not have a match within our set of regexes). I can organize this set of keywords in an efficient data structure (Binary Search/Trie/Aho-Corasick) to filter the set of regexes before I even try to run the text through the Finite Automata. There are extremely fast string matching algorithms that I can run as a filtering stage before I attempt to run a regular expression. I've been able to increase throughput many folds doing this simple process.
See the library Xeger which given a regular expression will give you all the possible strings that match.
You seem to only want to keep the common prefix of these strings (the part where you said to ignore optional operators) but if you do that you might capture stings that have that common prefix yet do not have the ending you want (such as "jumpy" in your example). If this is not a problem then just find the shortest string given by Xeger, assuming that optional operators occur only at the end of the regex.
If I understand your problem correctly, you are looking for a set of words such that all these words are (disjoint) substrings of any word accepted by the (given) regular expression.
My guess is that such a set will very often be empty, but nevertheless it can be found.
To find such a set, I propose the following algorithm:
Find the FA corresponding to your input regex.
Identify bridges ( https://en.wikipedia.org/wiki/Bridge_(graph_theory) ) between the starting state S and the accepting state F. This can for example be done by removing an edge E and asking whether a path exists from S to E in the FA with E removed - repeat this for all edges.
All edges that are bridges must be hit during an accepting run, and each edge corresponds to a letter of input.
You may now generate the required words by connecting subsequent bridge edges end-to-end.
I think it follows from the algorithm construction that an FA (and not a DFA) suffices for this to work. Again, a proof would be nice but I think it should work:)

Can one find out which input characters matched which part of a regex?

I'm trying to build a tool that uses something like regexes to find patterns in a string (not a text string, but that is not important right now). I'm familiar with automata theory, i.e. I know how to implement basic regex matching, and output true or false if the string matches my regex, by simulating an automaton in the textbook way.
Say I'm interested in all as that comes before bs, with no more as before the bs, so, this regex: a[^a]*b. But I don't just want to find out if my string contains such a part, I want to get as output the a, so that I can inspect it (remember, I'm not actually dealing with text).
In summary: Let's say I mark the a with parentheses, like so: (a)[^a]*b and run it on the input string bcadacb then I want the second a as output.
Or, more generally, can one find out which characters in the input string matches which part of the regex? How is it done in text editors? They at least know where the match started, because they can highlight the matches. Do I have to use a backtracking approach, or is there a smarter, less computationally expensive, way?
EDIT: Proper back references, i.e. capturing with parens and referencing with \1, etc. may not be necessary. I do know that back references do introduce the need for backtracking (or something similar) and make the problem (IIRC) NP-hard. My question, in essence, is: Is the capturing part, without the back referencing, less computationally expensive than proper back references?
Most text editors do this by using a backtracking algorithm, in which case recording the match locations is trivial to add.
It is possible to do with a direct NFA simulation too, by augmenting the state lists with parenthesis location information. This can be done in a way that preserves the linear time guarantee. See http://swtch.com/~rsc/regexp/regexp2.html#submatch.
Timos's answer is on the right track, but you cannot tag DFA states, because a DFA state corresponds to a collection of possible NFA states, and so one DFA state might represent the possibility of having passed a paren (but maybe something else too) and if that turns out not to be the case, it would be incorrect to record it as fact. You really need to work on the NFA simulation instead.
After you constructed your DFA for the matching, mark all states which correspond to the first state after an opening parenthesis in the regex. When you visit such a state, save the index of the current input character, when you visit a state which corresponds to a closing parenthesis, also save the index.
When you reach an accepting state, output the two indices. I am not sure if this is the algorithm used in text editors, but that's how I would do it.

Create regex from samples algorithm

AFAIK no one have implemented an algorithm that takes a set of strings and substrings and gives back one or more regular expressions that would match the given substrings inside the strings. So, for instance, if I'd give my algorithm this two samples:
string1 = "fwef 1234 asdfd"
substring1 = "1234"
string2 = "asdf456fsdf"
substring2 = "456"
The algorithm would give me the regular expression "[0-9]*" back. I know it could give more than one regex or even no possible regex back and you might find 1000 reasons why such algorithm would be close to impossible to implement to perfection. But what's the closest thing?
I don't really care about regex itself also. Basically what I want is an algorithm that takes samples as the ones above and then finds a pattern in them that can be used to easily find the "kind" of text I want to find in a string without having to write any regex or code manually.
I don't have proof but I suspect no such discrete algorithm with a finite output could exist since you are asking for the creation of a regular language which could be "large" in respect to the input size.
With that, I suggest you peek at txt2re which can break down sample texts one-by-one and help you build regexes.
FlashFill a new feature of MS Excel 2013 would do exactly the task you want, but it does not give you the regular expression. It's a NP-complete problem and an open question for practical purposes. If you're interested in how to synthesise string manipulation from multiple examples, Go Flash Fill official website and read a few papers. They have pseudo-code and demo. movies as well.
There are many such algorithm in fact. This is a research area called "Grammatical inference".
I know RPNI, for example. (you could also look on the probabilistic branch, alergia, MDI, DEES). These algorithms generate DSA (Deterministic State Automata). In fact you absolutely don't need to enter the strings in your example. Only substrings.
There are also some algorithms to generate directly Non deterministic automata.
Of course, get the regular expression from an Non Deterministic Automata is easy.
The main ideas are simple:
Generate a PTSA (Prefix Tree State Automata) from your sample.
Then, you have to try to "merge" some states. From these merge, will emerge loops (i.e. * in the regular expression). All the difficulty being to choose the right rule to merge.
Here you go:
re = '|'.join(substrings)
If you want anything more general, your algorithm is going to have to make educated guesses about what type of strings are acceptable as matches, and it's trivial to demonstrate that no procedure can account for all possible sets of possible inputs without simply enumerating them all. For instance, consider some of these scenarios:
Match all prime numbers
Match hexadecimal strings, but no strings containing 'f' are in the sample set
Match the same string repeated twice
Match any even-length string
The root problem is that your question is incompletely specified. If you have a more specific requirement, that might be solvable, depending on what it is.

Distance between regular expression

Can we compute a sort of distance between regular expressions ?
The idea is to mesure in which way two regular expression are similar.
You can build deterministic finite-state machines for both regular expressions and compare the transitions. The difference of both transitions can then be used to measure the distance of these regular expressions.
There are a few of metrics you could use:
The length of a valid match. Some regexs have a fixed size, some an upper limit and some a lower limit. Compare how similar their lengths or possible lengths are.
The characters that match. Any regex will have a set of characters a match can contain (maybe all characters). Compare the set of included characters.
Use a large document and see how many matches each regex makes and how many of those are identical.
Are you looking for strict equivalence?
I suppose you could compute a Levenshtein Distance between the actual Regular Experssion strings. That's certainly one way of measuring a "distance" between two different Regular Expression strings.
Of course, I think it's possible that regular expressions are not required here at all, and computing the Levenshtein Distance of the actual "value" strings that the Regular Expressions would otherwise be applied to, may yield a better result.
If you have two regular expressions and have a set of example inputs you could try matching every input against each regex. For each input:
If they both match or both don't match, score 0.
If one matches and the other doesn't, score 1.
Sum this score over all inputs, and this will give you a 'distance' between the regular expressions. This will give you an idea of how often two regular expressions will differ for typical input. It will be very slow to calculate if your sample input set is large. It won't work at all if both regexes fail to match for almost all random strings and your expected input is entirely random. For example the regex 'sgjlkwren' and the regex 'ueuenwbkaalf' would probably both never match anything if tested on random input, so this metric would say the distance between them is zero. That might or might not be what you want (probably not).
You might be able to analyze the structure of the regex and use biased random sampling to deliberately hit strings that match more frequently than in completely random input. For example, if both regex require that the string starts with 'foo', you could make sure that your test inputs also always start with foo, to avoid wasting time testing strings that you know will fail for both.
So in conclusion: unless you have a very specific situation with a restricted input set and/or restricted regular expression language, I'd say its not possible. If you do have some restrictions on your input and on the regular expression, it might be possible. Please specify what these restrictions are and maybe I can come up with something better.
There's an answer hidden in an earlier question here on SO: Generating strings from regexes. You can calculate an (asymmetric) distance measure by generating strings using one regex and checking how many of those match the other regex.
This can be optimized by stripping out shared prefixes/suffixes. E.g. a[0-9]* and a[0-7]* share the a prefix, so you can calculate the distance between [0-9]* and [0-7]* instead.
I think first you need to understand for yourself how you see a "difference" between two expressions. Basically, define a distance metric.
In general case, it would be quite different to make. Depending on what you need to do, you may see allowing one different character in some place as a big difference. In the other case, allowing any number of consequent but same characters may not yield much difference.
I'd like to emphasize as well that normally when they talk about distance functions, they apply them to..., well, let's call them, tokens. In our case, character sequences. What you are willing to do, is to apply this method not to those tokens, but to the rules a multitude of tokens will match. I'm not quite sure it even makes sense.
Still, I believe we could think of something, but not in general, but for one particular and quite restricted case. Do you have some sort of example to show us?