Perl Regex for Substituting Any Character - regex

Essentially, I want to replace the u between the random character and the k to be an o. The output I should get from the substitution is dudok and rujok.
How can I do this in Perl? I'm very new to Perl so go easy on me.
This is what I have right now:
$text = "duduk, rujuk";
$_ = $text;
s/.uk/ok/g
print $_; #Output: duok, ruok Expected: dudok, rujok
EDIT: Forgot to mention that the last syllable is the only one that should be changed. Also, the random character is specifically supposed to be a random consonant, not just any random character.
I should mention that this is all based on Malay language rules for grapheme to phoneme conversion.

According to the this page, the Malayan language uses an unaccented latin alphabet, and it has the same consonants as the English language. However, its digraphs are different than English's.
ai vowel
au vowel
oi vowel
gh consonant
kh consonant
ng consonant
ny consonant
sy consonant
So, if one wanted to find a syllable ending with uk, one would look for
<syllable_boundary>(?:[bcdfhjlmpqrtvwxyz]|gh?|kh?|n[gv]?|sv?)uk
or
<syllable_boundary>uk
The OP is specifically disinterested in the latter, so we simply need to look for
<syllable_boundary>(?:[bcdfhjlmpqrtvwxyz]|gh?|kh?|n[gv]?|sv?)uk
So now, we have to determine how to find a syllable boundary. ...or do we? All the consonant digraphs end with a consonant, and none of the vowel digraphs end in a consonant so we simply need to look for
[bcdfghjklmnpqrstvwxyz]uk
Finally, we can use \b to check for the end of the word, so we're interested in matching
[bcdfghjklmnpqrstvwxyz]uk\b
Now, let's use this in a substitution.
s/([bcdfghjklmnpqrstvwxyz])uk\b/$1ok/g
or
s/(?<=[bcdfghjklmnpqrstvwxyz])uk\b/ok/g
or
s/[bcdfghjklmnpqrstvwxyz]\Kuk\b/ok/g
The last one is the most efficient, but it requires Perl 5.10+. (That shouldn't be a problem given how ancient it is.)

Change your regex to:
s/(.)uk/$1ok/g;

As ikegami raised, the word "bukuk" would have two substitutions. This is not the desired outcome as only the last syllable should be changed. Also, I forgot to mention that the change should only be done for a random consonant, u, and followed by k (e.g. ruk, not auk).
As such, taking everything into account that has been answered, the correct regex should be:
s/(\w*[bcdfghjklmnpqrstvwxyz])uk\b/$1ok/g;
EDIT: As ikegami has raised again, the complement of vowels - [^aeiou] will match for other characters like "-" and " " which is undesired. Updated the solution.

Related

Is there a more efficient RegEx for solving Wordle?

So I have a list of all 5 letter words in the English language that I can interrogate when I'm really stuck at Wordle. I found this an excellent exercise for brushing up on my Regular Expressions in BBEDIT, which is what I tell myself I'm doing.
The way wordle works, I can have three conditions.
A letter that is somewhere in the word (and must be present)
A letter that is not present in the word
A letter that is correct in presence and position
Condition 3 is easy. If my start word "crone" has the n in the right place, my pattern is
...n.
And I can add condition 2 fairly easily with
^(?!.*[croe])...n.
If my next guess is "burns" I'll know there's an "s"
^(?!.*[croebur])^(?=.*s)...n.
And that it's not in the last position:
^(?!.*[croebur])^(?=.*s)...n[^s]
If my next (very poor) guess is 'stone' I'll know there's a 't'.
^(?!.*[croebur])^(?=.*s)^(?=.*t)sa.n.
So that's a workable formula.
But if my next guess were "wimpy" I'd know there was an 'i' in the answer, but I have to add an additional ^(?=.*i) which just feels inefficient. I tried grouping the letters that must be in the word by using a bracket set, ^(?=.*[ist]) but of course that will match targets that contain any one of those characters rather than all.
Is there a more efficient way to express the phrase "the word must contain all of the following letters to match" than a series of "start at the beginning, scan for occurence of this single character until the end" phrases?
If you enter a word into Wordle, it displays all the matched characters in your word. It also shows the characters which exist in the word but not in the correct order.
Considering these requirements, I think you should create different rules for each letter's place. This way, your regex pattern keeps simple, and you get the search results quickly. Let me give an example:
Input word: crone
Matched Characters: ...n.
Characters in the wrong place: -
Next regex search pattern: ^[^crone][^crone][^crone]n[^crone]$
Input word: burns
Matched Characters: ...n.
Characters in the wrong place: s
Next regex search pattern: ^(?=\S*[s]\S*)[^bucrone][^bucrone][^bucrone]n[^bucrones]$ (Be careful, there is an "s" character in the last parenthesis because we know its place isn't there.)
Input word: stone
Matched Characters: s..n.
Characters in the wrong place: t
Next regex search pattern: ^(?=\S*[t]\S*)s[^tsbucrone][^sbucrone]n[^sbucrones]$ (Be careful, there is a "t" character in the first parenthesis because we know its place isn't there.)
^ => Start of the line
[^abc] => Any character except "a" and "b" and "c"
(?=\S*[t]\S*)=> There must be a "t" character in the given string
(?=\S*[t]\S*)(?=\S*[u]\S*)=> There must be "t" and "u" characters in the given string
$ => End of the line
When we look at performance tests of the regex patterns with a seven-word sample, my regex pattern found the result in 130 steps, whereas your pattern in 175 steps. The performance difference will increase as the word-list increase. You can review it from the following links:
Suggested pattern: https://regex101.com/r/mvHL3J/1
Your pattern: https://regex101.com/r/Nn8EwL/1
Note: You need to click the "Regex Debugger" link in the left sidebar to see the steps.
Note 2: I updated my response to fix the bug in the following comment.

What pattern in a grep command will match each character at most once?

I want to find words in a document using only the letters in a given pattern, but those letters can appear at most once.
Suppose document.txt consists of "abcd abbcd"
What pattern (and what concepts are involved in writing such a pattern) will return "abcd" and not "abbcd"?
You could check if a character appears more than once and then negate the result (in your source code):
split your document into words
check each word with ([a-z])[a-z]*\1 (that matches abbcd, but not abcd)
negate the result
Explanation:
([a-z]) matches any single character
[a-z]* allows none or more characters after the one matched above
\1 is a back reference to the character found at ([a-z])
There were already some good ideas here, but I wanted to offer an example implementation in python. This isn't necessarily optimal, but it should work. Usage would be:
$ python find.py -p abcd < file.txt
And the implementation of find.py is:
import argparse
import sys
from itertools import cycle
parser = argparse.ArgumentParser()
parser.add_argument('-p', required=True)
args = parser.parse_args()
for line in sys.stdin:
for candidate in line.split():
present = dict(zip(args.p, cycle((0,)))) # initialize dict of letter:count
for ch in candidate:
if ch in present:
present[ch] += 1
if all(x <= 1 for x in present.values()):
print(candidate)
This handles your requirement of matching each character in the pattern at most once, i.e. it allows for zero matches. If you wanted to match each character exactly once, you'd change the second-to-last line to:
if all(x == 1 for x in present.values()):
Melpomene is right, regexps are not the best instrument to solve this task. Regexp is essentially a finite state machine. In your case current state can be defined as the combination of presence flags for each of the letters from your alphabet. Thus the total number of internal states in regex will be 2^N where N is the number of allowed letters.
The easiest way to define such regex will be list all possible permutations of available letters (and use ? to eliminate necessity to list shorter sequences). For three letters (a,b,c) regex looks like:
a?b?c?|a?c?b?|b?a?c?|b?c?a?|c?a?b?|c?b?a?
For the four letters (a,b,c,d) it becomes much longer:
a?b?c?d?|a?b?d?c?|a?c?b?d?|a?c?d?b?|a?d?b?c?|a?d?c?b?|b?a?c?d?|b?a?d?c?|b?c?a?d?|b?c?d?a?|b?d?a?c?|b?d?c?a?|c?a?b?d?|c?a?d?b?|c?b?a?d?|c?b?d?a?|c?d?a?b?|c?d?b?a?|d?a?b?c?|d?a?c?b?|d?b?a?c?|d?b?c?a?|d?c?a?b?|d?c?b?a?
As you can see, not that convenient.
The solution without regexps depends on your toolset. I would write a simple program that processes input text word by word. At the start of the word BitSet is created, where each bit represents the presence of the corresponding letter of the desired alphabet. While traversing the word if bit that corresponds to the current letter is zero it becomes one. If already marked bit occurs or letter is not in alphabet, word is skipped. If word is completely evaluated, then it's "valid".
grep -Pwo '[abc]+' | grep -Pv '([abc]).*\1' | awk 'length==3'
where:
first grep: a word composed by the pattern letters...
second grep: ... with no repeated letters ...
awk: ...whose length is the number of letters

How to find consonant clusters with regex?

I want to find consonant clusters with regex. An example of a cluster is mpl in examples.
To start, I filtered out all the vowels and replaced them with spaces. With vowels filtered out, examples is x mpl s.
How can I filter out the x and the s too?
Seems like you want something like this,
(?:(?![aeiou])[a-z]){2,}
(?![aeiou])[a-z] means choose any character from the lowercase alphabets but not of a or e or i or o or u
DEMO
(?![aeiou])[a-z] Matches a lowercase consonent
(?:(?![aeiou])[a-z]){2,} two or more times.
Since your working definition of "consonant cluster" is two or more consonants in succession, you can simply use the following pattern (case-insensitively if you want to handle capital consonants):
[bcdfghjklmnpqrstvwxyz]{2,}
[bcdfghjklmnpqrstvwxyz] – a simple whitelist character class for consonants (i.e. that will only match a consonant)
{2,} – two or more in succession
You can test the pattern against a couple input strings in a related regex fiddle.
Note that since vowels are "a, e, i, o, u, and sometimes y", I have included y in the whitelist character class for consonants above.
You could drop y and use...
[bcdfghjklmnpqrstvwxz]{2,}
...if you want to unconditionally treat y as a vowel rather than a consonant; but the rules for when y is a consonant are a bit more complicated than a simple regex will handle (basically requiring that you identify syllables first, then y's location within them).
Turning a comment into an answer…
As you changed vowels into white space: Search for \b.\b (or \b\w\b to target a bit better) and replace with a blank - to get rid of all isolated letters, leaving you with sequences of at least two.
Like RegEx101.

Regular expression matching any subset of a given set?

Is it possible to write a regular expression which will match any subset of a given set of characters a1 ... an ?
I.e. it should match any string where any of these characters appears at most once, there are no other characters and the relative order of the characters doesn't matter.
Some approaches that arise at once:
1. [a1,...,an]* or (a1|a2|...|an)*- this allows multiple presence of characters
2. (a1?a2?...an?) - no multiple presence, but relative order is important - this matches any subsequence but not subset.
3. ($|a1|...|an|a1a2|a2a1|...|a1...an|...|an...a1), i.e. write all possible subsequences (just hardcode all matching strings :)) of course, not acceptable.
I also have a guess that it may be theoretically impossible, because during parsing the string we will need to remember which character we have already met before, and as far as I know regular expressions can check out only right-linear languages.
Any help will be appreciated. Thanks in advance.
This doesn't really qualify for the language-agnostic tag, but...
^(?:(?!\1)a1()|(?!\2)a2()|...|(?!\n)an())*$
see a demo on ideone.com
The first time an element is matched, it gets "checked off" by the capturing group following it. Because the group has now participated in the match, a negative lookahead for its corresponding backreference (e.g., (?!\1)) will never match again, even though the group only captured an empty string. This is an undocumented feature that is nevertheless supported in many flavors, including Java, .NET, Perl, Python, and Ruby.
This solution also requires support for forward references (i.e., a reference to a given capturing group (\1) appearing in the regex before the group itself). This seems to be a little less widely supported than the empty-groups gimmick.
Can't think how to do it with a single regex, but this is one way to do it with n regexes: (I will usr 1 2 ... m n etc for your as)
^[23..n]*1?[23..n]*$
^[13..n]*2?[13..n]*$
...
^[12..m]*n?[12..m]*$
If all the above match, your string is a strict subset of 12..mn.
How this works: each line requires the string to consist exactly of:
any number of charactersm drawn fromthe set, except a particular one
perhaps a particular one
any number of charactersm drawn fromthe set, except a particular one
If this passes when every element in turn is considered as a particular one, we know:
there is nothing else in the string except the allowed elements
there is at most one of each of the allowed elements
as required.
for completeness I should say that I would only do this if I was under orders to "use regex"; if not, I'd track which allowed elements have been seen, and iterate over the characters of the string doing the obvious thing.
Not sure you can get an extended regex to do that, but it's pretty easy to do with a simple traversal of your string.
You use a hash (or an array, or whatever) to store if any of your allowed characters has already been seen or not in the string. Then you simply iterate over the elements of your string. If you encounter an element not in your allowed set, you bail out. If it's allowed, but you've already seen it, you bail out too.
In pseudo-code:
foreach char a in {a1, ..., an}
hit[a1] = false
foreach char c in string
if c not in {a1, ..., an} => fail
if hit[c] => fail
hit[c] = true
Similar to Alan Moore's, using only \1, and doesn't refer to a capturing group before it has been seen:
#!/usr/bin/perl
my $re = qr/^(?:([abc])(?!.*\1))*$/;
foreach (qw(ba pabc abac a cc cba abcd abbbbc), '') {
print "'$_' ", ($_ =~ $re) ? "matches" : "does not match", " \$re \n";
}
We match any number of blocks (the outer (?:)), where each block must consist of "precisely one character from our preferred set, which is not followed by a string containing that character".
If the string might contain newlines or other funny stuff, it might be necessary to play with some flags to make ^, $ and . behave as intended, but this all depends on the particular RE flavor.
Just for sillyness, one can use a positive look-ahead assertion to effectively AND two regexps, so we can test for any permutation of abc by asserting that the above matches, followed by an ordinary check for 'is N characters long and consists of these characters':
my $re2 = qr/^(?=$re)[abc]{3}$/;
foreach (qw(ba pabc abac a cc abcd abbbbc abc acb bac bca cab cba), '') {
print "'$_' ", ($_ =~ $re2) ? "matches" : "does not match", " \$re2 \n";
}

regex to match a maximum of 4 spaces

I have a regular expression to match a persons name.
So far I have ^([a-zA-Z\'\s]+)$ but id like to add a check to allow for a maximum of 4 spaces. How do I amend it to do this?
Edit: what i meant was 4 spaces anywhere in the string
Don't attempt to regex validate a name. People are allowed to call themselves what ever they like. This can include ANY character. Just because you live somewhere that only uses English doesn't mean that all the people who use your system will have English names. We have even had to make the name field in our system Unicode. It is the only Unicode type in the database.
If you care, we actually split the name at " " and store each name part as a separate record, but we have some very specific requirements that mean this is a good idea.
PS. My step mum has 5 spaces in her name.
^ # Start of string
(?!\S*(?:\s\S*){5}) # Negative look-ahead for five spaces.
([a-zA-Z\'\s]+)$ # Original regex
Or in one line:
^(?!(?:\S*\s){5})([a-zA-Z\'\s]+)$
If there are five or more spaces in the string, five will be matched by the negative lookahead, and the whole match will fail. If there are four or less, the original regex will be matched.
Screw the regex.
Using a regex here seems to be creating a problem for a solution instead of just solving a problem.
This task should be 'easy' for even a novice programmer, and the novel idea of regex has polluted our minds!.
1: Get Input
2: Trim White Space
3: If this makes sence, trim out any 'bad' characters.
4: Use the "split" utility provided by your language to break it into words
5: Return the first 5 Words.
ROCKET SCIENCE.
replies
what do you mean screw the regex? your obviously a VB programmer.
Regex is the most efficient way to work with strings. Learn them.
No. Php, toyed a bit with ruby, now going manically into perl.
There are some thing ( like this case ) where the regex based alternative is computationally and logically exponentially overly complex for the task.
I've parse entire php source files with regex, I'm not exactly a novice in their use.
But there are many cases, such as this, where you're employing a logging company to prune your rose bush.
I could do all steps 2 to 5 with regex of course, but they would be simple and atomic regex, with no weird backtracking syntax or potential for recursive searching.
The steps 1 to 5 I list above have a known scope, known range of input, and there's no ambiguity to how it functions. As to your regex, the fact you have to get contributions of others to write something so simple is proving the point.
I see somebody marked my post as offensive, I am somewhat unhappy I can't mark this fact as offensive to me. ;)
Proof Of Pudding:
sub getNames{
my #args = #_;
my $text = shift #args;
my $num = shift #args;
# Trim Whitespace from Head/End
$text =~ s/^\s*//;
$text =~ s/\s*$//;
# Trim Bad Characters (??)
$text =~ s/[^a-zA-Z\'\s]//g;
# Tokenise By Space
my #words = split( /\s+/, $text );
#return 0..n
return #words[ 0 .. $num - 1 ];
} ## end sub getNames
print join ",", getNames " Hello world this is a good test", 5;
>> Hello,world,this,is,a
If there is anything ambiguous to anybody how that works, I'll be glad to explain it to them. Noted that I'm still doing it with regexps. Other languages I would have used their native "trim" functions provided where possible.
Bollocks -->
I first tried this approach. This is your brain on regex. Kids, don't do regex.
This might be a good start
/([^\s]+
(\s[^\s]+
(\s[^\s]+
(\s[^\s]+
(\s[^\s]+|)
|)
|)
|)
)/
( Linebroken for clarity )
/([^\s]+(\s[^\s]+(\s[^\s]+(\s[^\s]+|)|)|))/
( Actual )
I've used [^\s]+ here instead of your A-Z combo for succintness, but the point is here the nested optional groups
ie:
(Hello( this( is( example))))
(Hello( this( is( example( two)))))
(Hello( this( is( better( example))))) three
(Hello( this( is()))))
(Hello( this()))
(Hello())
( Note: this, while being convoluted, has the benefit that it will match each name into its own group )
If you want readable code:
$word = '[^\s]+';
$regex = "/($word(\s$word(\s$word(\s$word(\s$word|)|)|)|)|)/";
( it anchors around the (capture|) mantra of "get this, or get nothing" )
#Sir Psycho : Be careful about your assumptions here. What about hyphenated names? Dotted names (e.g. Brian R. Bondy) and so on?
Here's the answer that you're most likely looking for:
^[a-zA-Z']+(\s[a-zA-Z']+){0,4}$
That says (in English): "From start to finish, match one or more letters, there can also be a space followed by another 'name' up to four times."
BTW: Why do you want them to have apostrophes anywhere in the name?
^([a-zA-Z']+\s){0,4}[a-zA-Z']+$
This assumes you want 4 spaces inside this string (i.e. you have trimmed it)
Edit: If you want 4 spaces anywhere I'd recommend not using regex - you'd be better off using a substr_count (or the equivalent in your language).
I also agree with pipTheGeek that there are so many different ways of writing names that you're probably best off trusting the user to get their name right (although I have found that a lot of people don't bother using capital letters on ecommerce checkouts).
Match multiple whitespace followed by two characters at the end of the line.
Related problem ----
From a string, remove trailing 2 characters preceded by multiple white spaces... For example, if the column contains this string -
" 'This is a long string with 2 chars at the end AB "
then, AB should be removed while retaining the sentence.
Solution ----
select 'This is a long string with 2 chars at the end AB' as "C1",
regexp_replace('This is a long string with 2 chars at the end AB',
'[[[:space:]][a-zA-Z][a-zA-Z]]*$') as "C2" from dual;
Output ----
C1
This is a long string with 2 chars at the end AB
C2
This is a long string with 2 chars at the end
Analysis ----
regular expression specifies - match and replace zero or more occurences (*) of a space ([:space:]) followed by combination of two characters ([a-zA-Z][a-zA-Z]) at the end of the line.
Hope this is useful.