Regex string with 3 or more vowels - regex

I'm trying to make a regular expression that matches a String with 3 or more vowels.
I've tried this one:
[aeiou]{3,}
But it only works when the vowels are in a sequence. Any tips ?
For example:
Samuel -> valid
Joan -> invalid
Sol Manuel -> valid
Sol -> Invalid

There are several ways to do it and in this case keeping it simple will probably be the most helpful to future devs maintaining that code. That's a fun part about regexes, you can make them very efficient and clever and then very hard for somebody who doesn't do them often to update.
import re
regex = "[aeiou].*[aeiou].*[aeiou]"
mylist = [
"Samuel", #yes!
"JOAN", #no!
"Sol Manuel", #yes!
"", #no!
]
for text in mylist:
if re.search(regex, text, re.IGNORECASE):
print ("Winner!")
else:
print ("Nein!")
You could also adjust each part to be [aeiouAEIOU] if you don't have an ignore case flag in your language of choice. Good luck! :)

just
(\w*[aeuio]\w*){3,}
or if you want line match
^(.*[aeuio].*){3,}$

This can be achieved using lookaheads like this.
Regex: ^(?=.*[aeiou].*[aeiou].*[aeiou])(?:[a-z] *)+$
Explanation:
(?=.*[aeiou].*[aeiou].*[aeiou]) positive lookahead checks for presence of any character followed by vowel three times.
(?:[a-zA-Z] *)+ matches your one or more English words separated by spaces.
Regex101 Demo
If Case insensitive Mode is OFF use following regex
Regex: ^(?=.*[aeiouAEIOU].*[aeiouAEIOU].*[aeiouAEIOU])(?:[a-zA-Z] *)+$
Regex101 Demo

Try this pattern:
^.*[AEIOUaeiou].*[AEIOUaeiou].*[AEIOUaeiou].*$
We could also use a positive lookahead:
^(?=.*[AEIOUaeiou].*[AEIOUaeiou].*[AEIOUaeiou]).*$
Note that due to the possibility of backtracking I would probably prefer using the first (non lookahead) pattern because it should be more efficient.

I tried this using help from sniperd's answer:
def multi_vowel_words(text):
pattern = r"\w+[aeiou]\w*[aeiou]\w*[aeiou]\w+"
result = re.findall(pattern, text)
return result
This works even with uppercases.
If you have numbers and underscore in your text, then instead of \w use [a-zA-Z].

Related

Regex for Wordle

using the online word game Wordle (https://www.powerlanguage.co.uk/wordle/) to sharpen my Regex.
I could use a little help with something that I imagine Regex should solve easily.
given a 5 letter english word
given that I know the word begins with pr
given that I know that the letters outyase are not found in the word
given that I know that the letter i IS found in the word
what is the correct - most simplified regex?
my limited regex gives is this ^pr.[^outyase][^outyase]$ which is
a. redundant and
b. does not include the request to match i
any of you Regex Ninjas want to lend a hand, I would be much obliged.
by the way, the correct regex should return two nouns in the english language prick and primi, you can validate here https://www.visca.com/regexdict/
You may use this regex with a positive and negative lookahead conditions:
^pr(?=[a-z]*i)(?![a-z]*[outyase])[a-z]{3}$
Regex Explanation:
^: Start
pr: Match pr
(?=[a-z]*i): Positive lookahead to make sure we have an i ahead after 0 or more letters
(?![a-z]*[outyase])): Negative lookahead to disallow any of the [outyase] characters
[a-z]{3}: Match 3 letters
Demo Screenshot:
Trivially, you can use:
^pr([^outyase][^outyase]i|[^outyase]i[^outyase]|i[^outyase][^outyase])$
Also, according to your site, there's actually four words matching, not just two:
prick
primi
primp
prink
Try
^pr(?!.*[outyase])(?=.*i)[a-z]{3}$
(?!.*[outyase]) means don't match if any of outyase is found ahead in the string.
(?=.*i) means only match if there is an i ahead in the string.
Adding a note for general usage.
For any char position:
(?!.*[<BadChars>])(?=.*<firstGoodChar>)(?=.*<SecondGoodChar>)(?=.*<ThirdGoodChar>).*
If you know mghtlc are bad and i, o, and s are good:
^(?!.*[mghtlc])(?=.*i)(?=.*o)(?=.*s).*{5}$
It's trivial to add a pinned char at the front/back:
^b(?!.*[mghtlc])(?=.*i)(?=.*o)(?=.*s).*{4}$
^(?!.*[mghtlc])(?=.*i)(?=.*o)(?=.*s).*{4}n$
but I'm not sure how the look-ahead would work with a pinned char in the middle given that the found chars (i, o) can be on either side of the pinned s:
NOT WORKING:
^(?!.*[mghtlc])(?=.*i)(?=.*o).*{2}s(?!.*[mghtlc])(?=.*i)(?=.*o).*{2}$

regex to select only the zipcode

,Ray Balwierczak,4/11/2017,,895 Forest Hill Rd,Apalachin,NY,13732,y,,
i want to select only 13732 from the line. I came up with this regex
(\d)(\s*\d+)*(\,y,,)
But its also selecting the ,y,, .if i remove it that part from regex, the regex also gets valid for the date. please help me on this.
Generally, if you want to match something without capturing it, use zero-length lookaround (lookahead or lookbehind). In your case, you can use lookahead:
(\d)(\s*\d+)*(?=\,y,,)
The syntax (?=<stuff>) means "followed by <stuff>, without matching it".
More information on lookarounds can be found in this tutorial.
Regex: \D*(\d{5})\D*
Explanation: match 5 digits surrounded by zero or more non-digits on both sides. Then you can extract group containing the match.
Here's code in python:
import re
string = ",Ray Balwierczak,4/11/2017,,895 Forest Hill Rd,Apalachin,NY,13732,y,,"
search = re.search("\D*(\d{5})\D*", string)
print search.group(1)
Output:
13732

Regex match, return remaining rest of string

Simple regex function that matches the start of a string "Bananas: " and returns the second part. I've done the regex, but it's not the way I expected it to work:
import re
def return_name(s):
m = re.match(r"^Bananas:\s?(.*)", s)
if m:
# print m.group(0)
# print m.group(1)
return m.group(1)
somestring = "Bananas: Gwen Stefani" # Bananas: + name
print return_name(somestring) # Gwen Stefani - correct!
However, I'm convinced that you don't have identify the group with (.*) in order to get the same results. ie match first part of string - return the remaining part. But I'm not sure how to do that.
Also I read somewhere that you should be being cautious using .* in a regex.
You could use a lookbehind ((?<=)):
(?<=^Bananas:\s).*
Remember to use re.search instead of re.match as the latter will try to match at the start of the string (aka implicit ^).
As for the .* concerns - it can cause a lot of backtracking if you don't have a clear understanding of how regexes work, but in this case it is guaranteed to be a linear search.
Using the alternate regular expression module "regex" you could use perl's \K meta-character, which makes it able to discard previously matched content and only Keep the following.
I'm not really recommending this, I think your solution is good enough, and the lookbehind answer is also probably better than using another module just for that.

Regexp: replacing all [[??]] with {{param|??}}

I'm hoping some regexp guru to help me out with this:
I have strings such as [[AB]], [[ABC]] and [[BEC]], and I want to replace them with string {{param|AB}}, {{param|ABC}} and {{param|BEC}} respectively.
All source strings are inside [[]] and have 2 or 3 upper case letters. The idea is to transfer the letters inside brackets to the new format. It's fine if I need two different regexps for 2 and 3 letter long cases.
(if curious, this is for replacing large number of links with templates in a Mediawiki based page).
Thanks in advance!
You can replace the result of following regex :
/\[\[([A-Z]{2,3})\]\]/
with :
{{/param\|\1/}}
Not that some regex engines use $ for capture group so you may need to use {{/param\|$1/}}
If you want to exclude some words you can use a negative look ahead :
/^\[\[((?!AAA|BBB|CCC)[A-Z]{2,3})]]$/gm
But note that since that preceding regex use anchors if you are dealing with a multiline string you need to use m flag (multiline flag).
See demo https://regex101.com/r/cR8zG6/1
You can search using this regex:
\[\[(\w+)\]\]
and replace using:
{{param|$1}}
RegEx Demo

Regex string transformation/extraction

Code:
https://aaa.bbb.net/ccc/211099_589944494365122_1446403980_n.jpg
How can I get 589944494365122 out of that string using regex?
The best I can do so far is _(.*) resulting 589944494365122_1446403980_n.jpg
First, you should generalize your problem description, like that: How can I get the longest non-empty substring of digits after the first _ in string? The regexp you literally asked for is (589944494365122), but that's not what you expect.
According to my guess about what you want, the answer could be _(\d+).
The rule of extraction I can see in your input is:
211099_589944494365122_1446403980
[0-9]+_ part we want _[0-9]+
so a regex with look-behind and look-ahead will help:
'(?<=\d_)\d+(?=_\d)'
test with grep:
kent$ echo " https://aaa.bbb.net/ccc/211099_589944494365122_1446403980_n.jpg"|grep -Po '(?<=\d_)\d+(?=_\d)'
589944494365122
This works;
var s = "https://aaa.bbb.net/ccc/211099_589944494365122_1446403980_n.jpg";
var m = /_([^_]*)/.exec(s);
console.log( m[1] ); // 589944494365122
I would go with \d+_(\d+)_\d+_n\.jpg, but depending on the exact specification of the URL this may need a little bit of tweaking.
Also depending on the language, this may need to be altered a little bit. The solution I suggest will work for instance in Ruby (as well as many other regex implementations). Here \d matches any digit and \d+ means one or more digits. I assume the letter before .jpg is always n but you may change this by either replacing n with .(any character) or with \w (any word character).