RegEx: Any string must contain at least N chars from a specific list of chars - regex

I'm very new to learning RegEx, and need a little help updating what I have. I will use it to evaluate student spreadsheet functions. I know it isn't perfect, but I'm trying to use this as a stepping stone to a better understanding of RegEx. I currently have [DE45\\+\\s]*$ but it does not validate for criteria #4 below. Any help is greatly appreciated.
I need to validate an input so that it matches these four criteria:
Letters D and E: (UPPERCASE, in any order, in any length string)
Numbers 4 and 5: (in any order, in any length string) Special
Characters: comma (,) and plus (+) (in any order, in any length string)
All six characters DE45+, must be present in the string at least once.
Results
pass: =if(D5>0,E4+D5,0)
pass: =if(D5>0,D5+E4,0)
fail: Dad Eats # 05:40
pass: Dad, Eats+Drinks # 05:40
fail: =if(E4+D5)
pass: DE45+,

The attempt you made -- with a character class -- will not work since [DE45] matches any single character in the class -- not all of them.
This type of problem is solved with a series of anchored lookaheads where all of these need to be true for a match at the anchor:
^(?=.*D)(?=.*E)(?=.*\+)(?=.*4)(?=.*5)(?=.*,)
Demo
Lookaround tutorial
Also, depending on the language, you can chain logic with regex matches. In Perl for example you would do:
/D/ && /E/ && /\+/ && /4/ && /5/ && /,/
In Python:
all(re.search(p, a_str) for p in [re.escape(c) for c in 'DE45+,'])
Of course easier still is to use a language's set functions to test that all required characters are present.
Here in Python:
set(a_str) >= set('DE45+,')
This returns True only if all the characters in 'DE45+,' are in a_str.

A Regular Expression character class (in the square brackets) is an OR search. It will match if any one of the characters in it is present, which does not allow you to verify #4.
For that you could build on top of a regex, as follows:
Find all instances of any of the characters you're looking for individually with a simple character class search. (findall using [DE45+,]+)
Merge all the found characters into one string (join)
Do a set comparison with {DE45+,}. This will only be True if all the characters are present, in any amount and in any order (set)
set(''.join(re.findall(r'[DE45+,]+','if(D5>0,4+D5,0)E'))) == set('DE45+,')
You can generalize this for any set of characters:
import re
lookfor = 'DE45+,'
lookfor_re = re.compile(f'[{re.escape(lookfor)}]+')
strings = ['=if(D5>0,E4+D5,0)', '=if(D5>0,D5+E4,0)', 'Dad Eats # 05:40', 'Dad, Eats+Drinks # 05:40', '=if(E4+D5)', 'DE45+,']
for s in strings:
found = set(''.join(lookfor_re.findall(s))) == set(lookfor)
print(f'{s} : {found}')
Just set lookfor as a string containing each of the characters you're looking for and strings as a list of the strings to search for. You don't need to worry about escaping any special characters with \. re.escape does this for you here.
=if(D5>0,E4+D5,0) : True
=if(D5>0,D5+E4,0) : True
Dad Eats # 05:40 : False
Dad, Eats+Drinks # 05:40 : True
=if(E4+D5) : False
DE45+, : True

Related

Shorten Regular Expression (\n) [duplicate]

I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.

convert string to regex pattern

I want to find the pattern of a regular expression from a character string. My goal is to be able to reuse this pattern to find a string in another context but checking the pattern.
from sting "1example4whatitry2do",
I want to find pattern like: [0-9]{1}[a-z]{7}[0-9]{1}[a-z]{8}[0-9]{1}[a-z]{2}
So I can reuse this pattern to find this other example of sting 2eytmpxe8wsdtmdry1uo
I can do a loop on each caracter, but I hope there is a fast way
Thanks for your help !
You can puzzle this out:
go over your strings characterwise
if the character is a text character add a 't' to a list
if the character is a number add a 'd' to a list
if the character is something else, add itself to the list
Use itertools.groupby to group consecutive identical letters into groups.
Create a pattern from the group-key and the length of the group using some string literal formatting.
Code:
from itertools import groupby
from string import ascii_lowercase
lower_case = set(ascii_lowercase) # set for faster lookup
def find_regex(p):
cum = []
for c in p:
if c.isdigit():
cum.append("d")
elif c in lower_case:
cum.append("t")
else:
cum.append(c)
grp = groupby(cum)
return ''.join(f'\\{what}{{{how_many}}}'
if how_many>1 else f'\\{what}'
for what,how_many in ( (g[0],len(list(g[1]))) for g in grp))
pattern = "1example4...whatit.ry2do"
print(find_regex(pattern))
Output:
\d\t{7}\d\.{3}\t{6}\.\t{2}\d\t{2}
The ternary in the formatting removes not needed {1} from the pattern.
See:
str.isdigit()
If you now replace '\t'with '[a-z]' your regex should fit. You could also replace isdigit check using a regex r'\d' or a in set(string.digits) instead.
pattern = "1example4...whatit.ry2do"
pat = find_regex(pattern).replace(r"\t","[a-z]")
print(pat) # \d[a-z]{7}\d\.{3}[a-z]{6}\.[a-z]{2}\d[a-z]{2}
See
string module for ascii_lowercase and digits

What pattern in a grep command will match each character at most once?

I want to find words in a document using only the letters in a given pattern, but those letters can appear at most once.
Suppose document.txt consists of "abcd abbcd"
What pattern (and what concepts are involved in writing such a pattern) will return "abcd" and not "abbcd"?
You could check if a character appears more than once and then negate the result (in your source code):
split your document into words
check each word with ([a-z])[a-z]*\1 (that matches abbcd, but not abcd)
negate the result
Explanation:
([a-z]) matches any single character
[a-z]* allows none or more characters after the one matched above
\1 is a back reference to the character found at ([a-z])
There were already some good ideas here, but I wanted to offer an example implementation in python. This isn't necessarily optimal, but it should work. Usage would be:
$ python find.py -p abcd < file.txt
And the implementation of find.py is:
import argparse
import sys
from itertools import cycle
parser = argparse.ArgumentParser()
parser.add_argument('-p', required=True)
args = parser.parse_args()
for line in sys.stdin:
for candidate in line.split():
present = dict(zip(args.p, cycle((0,)))) # initialize dict of letter:count
for ch in candidate:
if ch in present:
present[ch] += 1
if all(x <= 1 for x in present.values()):
print(candidate)
This handles your requirement of matching each character in the pattern at most once, i.e. it allows for zero matches. If you wanted to match each character exactly once, you'd change the second-to-last line to:
if all(x == 1 for x in present.values()):
Melpomene is right, regexps are not the best instrument to solve this task. Regexp is essentially a finite state machine. In your case current state can be defined as the combination of presence flags for each of the letters from your alphabet. Thus the total number of internal states in regex will be 2^N where N is the number of allowed letters.
The easiest way to define such regex will be list all possible permutations of available letters (and use ? to eliminate necessity to list shorter sequences). For three letters (a,b,c) regex looks like:
a?b?c?|a?c?b?|b?a?c?|b?c?a?|c?a?b?|c?b?a?
For the four letters (a,b,c,d) it becomes much longer:
a?b?c?d?|a?b?d?c?|a?c?b?d?|a?c?d?b?|a?d?b?c?|a?d?c?b?|b?a?c?d?|b?a?d?c?|b?c?a?d?|b?c?d?a?|b?d?a?c?|b?d?c?a?|c?a?b?d?|c?a?d?b?|c?b?a?d?|c?b?d?a?|c?d?a?b?|c?d?b?a?|d?a?b?c?|d?a?c?b?|d?b?a?c?|d?b?c?a?|d?c?a?b?|d?c?b?a?
As you can see, not that convenient.
The solution without regexps depends on your toolset. I would write a simple program that processes input text word by word. At the start of the word BitSet is created, where each bit represents the presence of the corresponding letter of the desired alphabet. While traversing the word if bit that corresponds to the current letter is zero it becomes one. If already marked bit occurs or letter is not in alphabet, word is skipped. If word is completely evaluated, then it's "valid".
grep -Pwo '[abc]+' | grep -Pv '([abc]).*\1' | awk 'length==3'
where:
first grep: a word composed by the pattern letters...
second grep: ... with no repeated letters ...
awk: ...whose length is the number of letters

Regexp variable alphanumeric matching

I am trying to match the following sample:
ZU2A ZS6D-9 ZT0ER-7 ZR6PJH-12
It is a combination of letters and numbers (alphanumeric).
Here is an explanation:
It will always start with a capital (uppercase) Z
Followed always by only ONE(1) of R,S,T or U "[R|S|T|U]"
Followed always by only ONE(1) number "[0-9]"
Followed always by a minimum of ONE(1) and optionally a maximum of THREE(3) capital (uppercase) letters like this [A-Z]{1,3}
Optionally followed by "-" and a minimum of ONE(1) and a maximum of TWO(2) numbers
At the moment I have this:
Z[R|S|T|U][0-9][A-Z]{1,}(\-)?([0-9]{1,3})
But that does not seem to catch all the samples.
EDIT: Here is a sample of a complete string:
ZU0D>APT314,ZT1ER,WIDE1,ZS3PJ-2,ZR5STU-12*/V:/021414z2610.07S/02814.02Ek067/019/A=005475!w%<!
Any help would be appreciated.
Thank You
Danny
Your main problem is that the whole optional part should be surrounded by one set of parentheses marked with ? (=optional). All in all, you want
Z[RSTU][0-9][A-Z]{1,3}(?:-[0-9]{1,2})?
A couple of extra notes:
In a character group, you can simply list the characters. So for 2 you want either [RSTU] or (?:R|S|T|U).
A group in the form of (?:example) instead of (example) prevents the sub-expression from being returned as a match. It has no effect on which inputs are matched.
You don't need to escape - with a backslash outside of a character class.
Here's an example test case script in Python:
import re
s = r'Z[RSTU][0-9][A-Z]{1,3}(?:-[0-9]{1,2})?'
rex = re.compile(s)
for test in ('ZU2A', 'ZS6D-9', 'ZT0ER-7', 'ZR6PJH-12'):
assert rex.match(test), test
long_test = 'ZU0D>APT314,ZT1ER,WIDE1,ZS3PJ-2,ZR5STU-12*/V:/021414z2610.07S/02814.02Ek067/019/A=005475!w%<!'
found = rex.findall(long_test)
assert found == ['ZU0D', 'ZT1ER', 'ZS3PJ-2', 'ZR5STU-12'], found

Input validation with Python 3.4: can only contain

I would like to allow these characters [a-z]+\.+[0-9]*\_* (Must contain one or more lowercase alphabetical characters(a-z) and Must contain one or more periods(.) also can contain zero or more digits(0-9), zero or more underscores(_)) , but no others.
I have tried multiple ways without success:
import re
iStrings = str(input('Enter string? '))
iMatch = re.findall(r'[a-z]+\.+[0-9]*\_*', iStrings)
iiMatch = re.findall(r'[~`!#$%^&*()-+={}\[]|\;:\'"<,>.?/]', iStrings)
iiiMatch = iMatch != iiMatch
if iiiMatch:
print(':)')
else:
print(':(')
Another example:
import re
iStrings = str(input('Enter string? '))
iMatch = re.findall(r'[a-z]+\.+[0-9]*\_*', iStrings) not "[~`!#$%^&*()-+={}\[]|\;:\'"<,>.?/]" in iStrings
if iMatch:
print(':)')
else:
print(':(')
Any help would be much appreciated.
Edit: added clarification.
Edit: For additional information: https://forums.uberent.com/threads/beta-mod-changes.51520/page-8#post-939265
allow these characters [a-z]+\.+[0-9]*\_*
First off, [a-z]+ is not "a" character. Neither is [0-9]* nor \_*
I am assuming that you mean you want to allow letters, digits, underscores, dots, plusses and asterisks.
Try this:
^[\w*.+]+$
The \w already matches [a-z], [0-9] and _
The anchors ^ and $ ensure that nothing else is allowed.
From your question I wasn't clear if you wanted to allow a + character to match. If not, remove it from the character class: ^[\w*.]+$. Likewise, remove the * if it isn't needed.
In code:
if re.search(r"^[\w*.+]+$", subject):
# Successful match
else:
# Match attempt failed
EDIT following your comment:
For a string that must contain one or more letter, AND one or more dot, AND zero or more _, AND zero or more digits, we need lookaheads to enforce the one or more conditions. You can use this:
^(?=.*[a-z])(?=.*\.)[\w_.]+$