How to find consonant clusters with regex? - regex

I want to find consonant clusters with regex. An example of a cluster is mpl in examples.
To start, I filtered out all the vowels and replaced them with spaces. With vowels filtered out, examples is x mpl s.
How can I filter out the x and the s too?

Seems like you want something like this,
(?:(?![aeiou])[a-z]){2,}
(?![aeiou])[a-z] means choose any character from the lowercase alphabets but not of a or e or i or o or u
DEMO
(?![aeiou])[a-z] Matches a lowercase consonent
(?:(?![aeiou])[a-z]){2,} two or more times.

Since your working definition of "consonant cluster" is two or more consonants in succession, you can simply use the following pattern (case-insensitively if you want to handle capital consonants):
[bcdfghjklmnpqrstvwxyz]{2,}
[bcdfghjklmnpqrstvwxyz] – a simple whitelist character class for consonants (i.e. that will only match a consonant)
{2,} – two or more in succession
You can test the pattern against a couple input strings in a related regex fiddle.
Note that since vowels are "a, e, i, o, u, and sometimes y", I have included y in the whitelist character class for consonants above.
You could drop y and use...
[bcdfghjklmnpqrstvwxz]{2,}
...if you want to unconditionally treat y as a vowel rather than a consonant; but the rules for when y is a consonant are a bit more complicated than a simple regex will handle (basically requiring that you identify syllables first, then y's location within them).

Turning a comment into an answer…
As you changed vowels into white space: Search for \b.\b (or \b\w\b to target a bit better) and replace with a blank - to get rid of all isolated letters, leaving you with sequences of at least two.
Like RegEx101.

Related

Perl Regex for Substituting Any Character

Essentially, I want to replace the u between the random character and the k to be an o. The output I should get from the substitution is dudok and rujok.
How can I do this in Perl? I'm very new to Perl so go easy on me.
This is what I have right now:
$text = "duduk, rujuk";
$_ = $text;
s/.uk/ok/g
print $_; #Output: duok, ruok Expected: dudok, rujok
EDIT: Forgot to mention that the last syllable is the only one that should be changed. Also, the random character is specifically supposed to be a random consonant, not just any random character.
I should mention that this is all based on Malay language rules for grapheme to phoneme conversion.
According to the this page, the Malayan language uses an unaccented latin alphabet, and it has the same consonants as the English language. However, its digraphs are different than English's.
ai vowel
au vowel
oi vowel
gh consonant
kh consonant
ng consonant
ny consonant
sy consonant
So, if one wanted to find a syllable ending with uk, one would look for
<syllable_boundary>(?:[bcdfhjlmpqrtvwxyz]|gh?|kh?|n[gv]?|sv?)uk
or
<syllable_boundary>uk
The OP is specifically disinterested in the latter, so we simply need to look for
<syllable_boundary>(?:[bcdfhjlmpqrtvwxyz]|gh?|kh?|n[gv]?|sv?)uk
So now, we have to determine how to find a syllable boundary. ...or do we? All the consonant digraphs end with a consonant, and none of the vowel digraphs end in a consonant so we simply need to look for
[bcdfghjklmnpqrstvwxyz]uk
Finally, we can use \b to check for the end of the word, so we're interested in matching
[bcdfghjklmnpqrstvwxyz]uk\b
Now, let's use this in a substitution.
s/([bcdfghjklmnpqrstvwxyz])uk\b/$1ok/g
or
s/(?<=[bcdfghjklmnpqrstvwxyz])uk\b/ok/g
or
s/[bcdfghjklmnpqrstvwxyz]\Kuk\b/ok/g
The last one is the most efficient, but it requires Perl 5.10+. (That shouldn't be a problem given how ancient it is.)
Change your regex to:
s/(.)uk/$1ok/g;
As ikegami raised, the word "bukuk" would have two substitutions. This is not the desired outcome as only the last syllable should be changed. Also, I forgot to mention that the change should only be done for a random consonant, u, and followed by k (e.g. ruk, not auk).
As such, taking everything into account that has been answered, the correct regex should be:
s/(\w*[bcdfghjklmnpqrstvwxyz])uk\b/$1ok/g;
EDIT: As ikegami has raised again, the complement of vowels - [^aeiou] will match for other characters like "-" and " " which is undesired. Updated the solution.

Regex cannot prevent a match of suffix name made up using I,V,X and SR/JR

I am trying to prevent the inclusion of suffix name, for example, JR/SR, or other suffix made up of using I,V,X using regular expression way. To accomplish this I have implemented the following regex
((^((?!((\b((I+))\b)|(\b(V+)\b)|(\b(X+)\b)|\b(IV)\b|(\b(V?I){1,2}\b)|(\b(IX)\b)|(\bX[I|IX]{1,2}\b)|(\bX|X+[V|VI]{1,2}\b)|(\b(JR)\b)|(\b(SR)\b))).)*$))
Using this I am able to prevent various possible combination eg.,
'Last Name I',
'Last Name II',
'Last Name IJR',
'Last Name SRX' etc.
However, there are still couple of combinations remaining, which this regex can match. eg., 'Last Name IXV' or 'Last Name VXI'
These two I am not able to debug. Please suggest me in which part of this regex I can make changes to satisfy the requirement.
Thank you!
Try this pattern: .+\b(?:(?>[JS]R)|X|I|J|V)+$
Explanation:
.+ - match one or more of any characters
\b - word boudnary
(?:...) - non-capturing group
(?>...) - atomic group
[JS]R - match whether S or J followed by R
| - alternation: match what is on the left OR what's on the right
+ - quantifier: match one or more times preceeding pattern
$ - match end of the string
Demo
In order to solve this I have worked on the above regex a little bit more. And here is the final result that can successfully match up with the "roman numeral" upto thirty constituted I, V, and X.
"(\b(?!(IIX|IIV|IVV|IXX|IXI))I[IVX]{0,3}\b|\b(V|X)\b|\bV[I]{1,2}\b|\b((?!XVV|XVX)X([IXV]{1,2}))\b|\b[S|J]R\b)|^$"
What I have done here is:
I have taken those input into consideration which are standalone,
that is: SR or XXV I have observed the incorrect pattern and
have restricted them to match as a positive result.
Separate input has been ensured using \b the word boundary.
Word-boundary: It suggests that starting of a word, that means in
simple words it says "yes there is a word" or "no it is not."
it has done in the following way-
using negative lookahead (?!(IIX|IIV|IVV|IXX|IXI))
How I have arrived on this solution is given as follows:
I have observed closely all the pattern first, that from I to X - that is:
I
I I
I I I
I V
V
V I
V I I
V I I I (it is out of the range of 3 characters.)
I X
X
we have an I, V, and X at first position. Then there is another I, X and V
on the second position. After then again same I and V. I have
implemented this in the following regex of the above written code:
\b(?!(IIX|IIV|IVV|IXX|IXI))I[IVX]{0,3}\b
Start it with I and then look for any of I, V, or X in a range of 'zero' to 'three' characters, and do neglect invalid numbers written inside the ?!(IIX|IIV|IVV|IXX|IXI) Similarly, I have done with other combinations given below.
Then for V and X : \b(V|X)\b
Then for the VI, VII: \bV[I]{1,2}\b
Then for the XI - XXX: \b((?!XVV|XVX)X([IXV]{1,2}))\b
To validate a suffix name, i.e. JR, SR, one can use following regex: \b[S|J]R\b
and the last (^$) is for matching a blank string or in other words, when no input has provided to the given input-box or textbox.
You may post any question or suggestion, if you have.
Thanks!
Ps: This regex is simply a solution to validate "roman numbers" from 1 to 30 using I, V, and X. I hope it helps to learn a bit to each and every newbie of regex.
I solved this with a more explicit:
(.+) (?:(?>JR$|SR$|I$|II$|III$|IV$|MD$|DO$|PHD$))|(.+)
I know I could do something like [JS]R but I like the way this reads:
(.+) match any characters and then a space
(?:(?>JR$|SR$|I$|II$|III$|IV$|MD$|DO$|PHD$)) atomically look for but don't match endings like JR etc
|(.+) if you don't find the endings then match any characters
Feel free to add the endings you'd like to suit your needs.

Using Gsub to get matched strings in R - regular expression

I am trying to extract words after the first space using
species<-gsub(".* ([A-Za-z]+)", "\1", x=genus)
This works fine for the other rows that have two words, however row [9] "Eulamprus tympanum marnieae" has 3 words and my code is only returning the last word in the string "marnieae". How can I extract the words after the first space so I can retrieve "tympanum marnieae" instead of "marnieae" but have the answers stored in one variable called >species.
genus
[9] "Eulamprus tympanum marnieae"
Your original pattern didn't work because the subpattern [A-Za-z]+ doesn't match spaces, and therefore will only match a single word.
You can use the following pattern to match any number of words (other than 0) after the first, within double quotes:
"[A-Za-z]+ ([A-Za-z ]+)" https://regex101.com/r/p6ET3I/1
https://regex101.com/r/p6ET3I/2
This is a relatively simple, but imperfect, solution. It will also match trailing spaces, or just one or more spaces after the first word even if a second word doesn't exist. "Eulamprus " for example will successfully match the pattern, and return 5 spaces. You should only use this pattern if you trust your data to be properly formatted.
A more reliable approach would be the following:
"[A-Za-z]+ ([A-Za-z]+(?: [A-Za-z]+)*)"
https://regex101.com/r/p6ET3I/3
This pattern will capture one word (following the first), followed by any number of addition words (including 0), separated by spaces.
However, from what I remember from biology class, species are only ever comprised of one or two names, and never capitalized. The following pattern will reflect this format:
"[A-Za-z]+ ([a-z]+(?: [a-z]+)?)"
https://regex101.com/r/p6ET3I/4

Need some help constructing a regex in Java

I'm making a tool to find open reading frames for amino acids as a personal project. I have many strings that have characters consisting of the 26 uppercase English alphabet letters (A through Z). They look like this:
GMGMGRZMQGGRZR
I want to find all possible matches that are between the letters M and Z, with some additional rules.
There should not be any Z's in between an M and a Z
Example: If EMAZAZ is the input string then MAZ should match, MAZAZ should not
There can be multiple M's between an M and a Z
Example: If the input string is GMGMGRZMQGGRZR then MGMGRZ should match, but MGRZ shouldn't since there are more M's before the first M in MGRZ that could be used to match.
For Example
With the above string (GMGMGRZMQGGRZR), only MGMGRZ and MQGGRZ should match. MGMGRZMQGGRZ, MGRZ, and MGRZAMQGGRZ should NOT be match.
Does anyone know how to construct a regex like this? I consulted a few Java regex tutorials (I am using Java to write this program) but was unable to come up with a regex that followed all of the above rules.
The closest I have gotten is this regex:
M((?!(Z)))*Z
It shows that the substrings MGMGRZ, MQGGRZ, and MGRZ match. However, I do not want MGRZ to match.
What you want is:
(M[^Z]+Z)
DEMO
The regex works as follow: It will try to match an M, followed by any number of chars that are not a Z up to a Z
The thing is that every char is consumed only once from left to right, so in
GMGMGRZMQGGRZR
^----^ 1st match MGMGRZ
^----^ 2nd match MQGGRZ
And consequently, it will match MGRZ if you feed it alone to the regex !!

Grammar - RegEx - containing five vowels (aeiou)

I am trying to learn regular expression. I have
L = {a, b, x, y, z, i, o, u, e, c}
I want to construct a regular expression that describes a strings that contain the five vowels in alphabetical order (aeiou). All strings will have at least one of all five vowels.
Do I have to lay them out in order as they are in the set? like
a(b*x*y*z*i*o*u*ec*)iou
or can I mix them up like:
aeiou(b*x*y*z*c*)
Since, they are not in order in the set, does that mean the first solution is what I am looking for?
In most regex languages, you'll need something like:
[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*
That much is essentially uniform. You then have to deal with 'start of word' and 'end of word' issues, which depend on the context and the regex language. With one word per line, you can simply use '^' to start and '$'.
Using your preferred notation and knowing that the complete alphabet used consists of the 10 letters, and assuming you can do grouping, then you can write:
(b*c*x*y*z*)*a(b*c*x*y*z*)*e(b*c*x*y*z*)*i(b*c*x*y*z*)*i(b*c*x*y*z*)*u(b*c*x*y*z*)*
The (b*c*x*y*z*)* part says zero or more repeats of "zero or more b's followed by zero or more c's, ..., followed by zero or more z's". This does what you require; but it also demonstrates why character class notation is such a good idea.