Grammar - RegEx - containing five vowels (aeiou) - regex

I am trying to learn regular expression. I have
L = {a, b, x, y, z, i, o, u, e, c}
I want to construct a regular expression that describes a strings that contain the five vowels in alphabetical order (aeiou). All strings will have at least one of all five vowels.
Do I have to lay them out in order as they are in the set? like
a(b*x*y*z*i*o*u*ec*)iou
or can I mix them up like:
aeiou(b*x*y*z*c*)
Since, they are not in order in the set, does that mean the first solution is what I am looking for?

In most regex languages, you'll need something like:
[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*
That much is essentially uniform. You then have to deal with 'start of word' and 'end of word' issues, which depend on the context and the regex language. With one word per line, you can simply use '^' to start and '$'.
Using your preferred notation and knowing that the complete alphabet used consists of the 10 letters, and assuming you can do grouping, then you can write:
(b*c*x*y*z*)*a(b*c*x*y*z*)*e(b*c*x*y*z*)*i(b*c*x*y*z*)*i(b*c*x*y*z*)*u(b*c*x*y*z*)*
The (b*c*x*y*z*)* part says zero or more repeats of "zero or more b's followed by zero or more c's, ..., followed by zero or more z's". This does what you require; but it also demonstrates why character class notation is such a good idea.

Related

Regex cannot prevent a match of suffix name made up using I,V,X and SR/JR

I am trying to prevent the inclusion of suffix name, for example, JR/SR, or other suffix made up of using I,V,X using regular expression way. To accomplish this I have implemented the following regex
((^((?!((\b((I+))\b)|(\b(V+)\b)|(\b(X+)\b)|\b(IV)\b|(\b(V?I){1,2}\b)|(\b(IX)\b)|(\bX[I|IX]{1,2}\b)|(\bX|X+[V|VI]{1,2}\b)|(\b(JR)\b)|(\b(SR)\b))).)*$))
Using this I am able to prevent various possible combination eg.,
'Last Name I',
'Last Name II',
'Last Name IJR',
'Last Name SRX' etc.
However, there are still couple of combinations remaining, which this regex can match. eg., 'Last Name IXV' or 'Last Name VXI'
These two I am not able to debug. Please suggest me in which part of this regex I can make changes to satisfy the requirement.
Thank you!
Try this pattern: .+\b(?:(?>[JS]R)|X|I|J|V)+$
Explanation:
.+ - match one or more of any characters
\b - word boudnary
(?:...) - non-capturing group
(?>...) - atomic group
[JS]R - match whether S or J followed by R
| - alternation: match what is on the left OR what's on the right
+ - quantifier: match one or more times preceeding pattern
$ - match end of the string
Demo
In order to solve this I have worked on the above regex a little bit more. And here is the final result that can successfully match up with the "roman numeral" upto thirty constituted I, V, and X.
"(\b(?!(IIX|IIV|IVV|IXX|IXI))I[IVX]{0,3}\b|\b(V|X)\b|\bV[I]{1,2}\b|\b((?!XVV|XVX)X([IXV]{1,2}))\b|\b[S|J]R\b)|^$"
What I have done here is:
I have taken those input into consideration which are standalone,
that is: SR or XXV I have observed the incorrect pattern and
have restricted them to match as a positive result.
Separate input has been ensured using \b the word boundary.
Word-boundary: It suggests that starting of a word, that means in
simple words it says "yes there is a word" or "no it is not."
it has done in the following way-
using negative lookahead (?!(IIX|IIV|IVV|IXX|IXI))
How I have arrived on this solution is given as follows:
I have observed closely all the pattern first, that from I to X - that is:
I
I I
I I I
I V
V
V I
V I I
V I I I (it is out of the range of 3 characters.)
I X
X
we have an I, V, and X at first position. Then there is another I, X and V
on the second position. After then again same I and V. I have
implemented this in the following regex of the above written code:
\b(?!(IIX|IIV|IVV|IXX|IXI))I[IVX]{0,3}\b
Start it with I and then look for any of I, V, or X in a range of 'zero' to 'three' characters, and do neglect invalid numbers written inside the ?!(IIX|IIV|IVV|IXX|IXI) Similarly, I have done with other combinations given below.
Then for V and X : \b(V|X)\b
Then for the VI, VII: \bV[I]{1,2}\b
Then for the XI - XXX: \b((?!XVV|XVX)X([IXV]{1,2}))\b
To validate a suffix name, i.e. JR, SR, one can use following regex: \b[S|J]R\b
and the last (^$) is for matching a blank string or in other words, when no input has provided to the given input-box or textbox.
You may post any question or suggestion, if you have.
Thanks!
Ps: This regex is simply a solution to validate "roman numbers" from 1 to 30 using I, V, and X. I hope it helps to learn a bit to each and every newbie of regex.
I solved this with a more explicit:
(.+) (?:(?>JR$|SR$|I$|II$|III$|IV$|MD$|DO$|PHD$))|(.+)
I know I could do something like [JS]R but I like the way this reads:
(.+) match any characters and then a space
(?:(?>JR$|SR$|I$|II$|III$|IV$|MD$|DO$|PHD$)) atomically look for but don't match endings like JR etc
|(.+) if you don't find the endings then match any characters
Feel free to add the endings you'd like to suit your needs.

How to find consonant clusters with regex?

I want to find consonant clusters with regex. An example of a cluster is mpl in examples.
To start, I filtered out all the vowels and replaced them with spaces. With vowels filtered out, examples is x mpl s.
How can I filter out the x and the s too?
Seems like you want something like this,
(?:(?![aeiou])[a-z]){2,}
(?![aeiou])[a-z] means choose any character from the lowercase alphabets but not of a or e or i or o or u
DEMO
(?![aeiou])[a-z] Matches a lowercase consonent
(?:(?![aeiou])[a-z]){2,} two or more times.
Since your working definition of "consonant cluster" is two or more consonants in succession, you can simply use the following pattern (case-insensitively if you want to handle capital consonants):
[bcdfghjklmnpqrstvwxyz]{2,}
[bcdfghjklmnpqrstvwxyz] – a simple whitelist character class for consonants (i.e. that will only match a consonant)
{2,} – two or more in succession
You can test the pattern against a couple input strings in a related regex fiddle.
Note that since vowels are "a, e, i, o, u, and sometimes y", I have included y in the whitelist character class for consonants above.
You could drop y and use...
[bcdfghjklmnpqrstvwxz]{2,}
...if you want to unconditionally treat y as a vowel rather than a consonant; but the rules for when y is a consonant are a bit more complicated than a simple regex will handle (basically requiring that you identify syllables first, then y's location within them).
Turning a comment into an answer…
As you changed vowels into white space: Search for \b.\b (or \b\w\b to target a bit better) and replace with a blank - to get rid of all isolated letters, leaving you with sequences of at least two.
Like RegEx101.

ReGex - matching a logical implication

Given the alphabet {a, b, c}, how can I create a simple regular expression which matches exactly those words which meet the following criteria:
If the string "aa" occurs, then consequently "cc" must also occur (Note the logical implication).
The order of occurence doesn't matter ("cc" as well as "aa" can occur first).
Due to the former logical implication (if-then relationship), the string "cc" can occur even without "aa", but not vice versa.
I am looking for a way to implement this by using these syntax elements (., *, +, ?, |, ) as well as brackets.
Example what should be matched:
cc
abba
bccb
bccaa
ε (epsilon - empty string)
What shouldn't be matched:
aa
aacbcb
abaaba
baaa
caac
I have tried the following: a?b?(ba)*(ccaa)*(aacc)*c*b?
Not sure if I should answer this, since it's homework, but here goes...
First of, you identify the clauses P:"string 'aa' occurs" and Q:"string 'cc' occurs". Then you transform P -> Q into !P v Q. Then you translate this into the regular expression formed of the two parts:
^(((b|c)*(a(b|c)+)*a?)|(.*cc.*))$
The first part denies any 'aa' groups, and goes like this: allow any number of bs and cs at the beginning (including none), then if you find an a, force at least one different character after it. as followed by non as may happen any number of times. At the end we also have the a? to allow for the string to end in an a and we are sure it has no a before it because none of the 2 preceding groups end in a.
The second part is trivial.
Test it here: http://rubular.com/r/mAPHO8bulo
See it here: http://jex.im/regulex/...

How to create a regular expression to match non-consecutive characters?

How to create a regular expression for strings of a,b and c such that aa and bb will be rejected?
For example, abcabccababcccccab will be accepted and aaabc or aaabbcccc or abcccababaa will be rejected.
If this is not a purely academical question you can simply search for aa and bb and negate your logic, for example:
s='abcccabaa'
# continue if string does not match.
if re.search('(?:aa|bb)', s) is None:
...
or simply scan the string for the two patterns, avoiding expensive regular expressions:
if 'aa' not in s and 'bb' not in s:
...
For such an easy task RE is probably total overkill.
P.S.: The examples are in Python but the principle applies to other languages of course.
^(?!.*(?:aa|bb))[abc]+$
See it here on Regexr
This regex would do two things
verify that your string consist only of a,b and c
fail on aa and bb
^ matches the start of the string
(?!.*(?:aa|bb)) negative lookahead assertion, will fail if there is aa or bb in the string
[abc]+ character class, allows only a,b,c at least one (+)
$ matches the end of the string
Using the & operator (intersection) and ~ (complement):
(a|b|c)*&~(.*(aa|cc).*)
Rewriting this without the these operators is tricky. The usual approach is to break it into cases.
In this case it is not all that difficult.
Suppose that the letter c is taken out of the picture. The only sequences then which don't have aa and bb are:
e (empty string)
a
b
b?(ab)*a?
Next what we can do is insert some optional 'c' runs into all possible interior places:
e (empty string)
a
b
(bc*)?(ac*bc*)*a?
Next, we have to acknowledge that illegal sequences like aabb become accepted if non-optional 'c' runs are put in the middle, as in for example acacbcbc'. We allow a finalaandb. This pattern can take care of our loneaandb` cases as well as matching the empty string:
(ac+|bc+)*(a|b)?
Then combine them together:
((ac+|bc+)*(a|b)?|(bc*)?(ac*bc*)*a?|(ac+|bc+)(a|b)?)
We are almost there: we also need to recognize that this pattern can occur an arbitrary number of times, as long as there are dividing 'c'-s between the occurences, and with arbitrary leading or trailing runs of c-s around the whole thing
c*((ac+|bc+)*(a|b)?|(bc*)?(ac*bc*)*a?|(ac+|bc+)(a|b)?)(c+((ac+|bc+)*(a|b)?|(bc*)?(ac*bc*)*a?|(ac+|bc+)(a|b)?))*c*
Mr. Regex Philbin, I'm not coming up with any cases that this doesn't handle, so I'm leaving it as my final answer.

find a regular expression for strings containing the substring aba over the alphabet {a, b}? (formal language theory)

The questions asks to find a regular expression for strings containing the substring aba over the alphabet {a, b}.
Does this mean anything can precede/procede aba so that the regular expression would be:
(aUb)*(aba)*(aUb)*
or is the question simply looking for:
(aba)*
Note: U means union and * means 0 or more times.
Since * means 0 or more, ε is in the first language, while you do not want it (it doesn't contain aba). You are looking for (aUb)*aba(aUb)*.
A substring is defined as
noun
a string that is part of a longer string
Also note that the second expression is a subset of the first.
The former: any string that contains aba at least once.