I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.
I want to find the pattern of a regular expression from a character string. My goal is to be able to reuse this pattern to find a string in another context but checking the pattern.
from sting "1example4whatitry2do",
I want to find pattern like: [0-9]{1}[a-z]{7}[0-9]{1}[a-z]{8}[0-9]{1}[a-z]{2}
So I can reuse this pattern to find this other example of sting 2eytmpxe8wsdtmdry1uo
I can do a loop on each caracter, but I hope there is a fast way
Thanks for your help !
You can puzzle this out:
go over your strings characterwise
if the character is a text character add a 't' to a list
if the character is a number add a 'd' to a list
if the character is something else, add itself to the list
Use itertools.groupby to group consecutive identical letters into groups.
Create a pattern from the group-key and the length of the group using some string literal formatting.
Code:
from itertools import groupby
from string import ascii_lowercase
lower_case = set(ascii_lowercase) # set for faster lookup
def find_regex(p):
cum = []
for c in p:
if c.isdigit():
cum.append("d")
elif c in lower_case:
cum.append("t")
else:
cum.append(c)
grp = groupby(cum)
return ''.join(f'\\{what}{{{how_many}}}'
if how_many>1 else f'\\{what}'
for what,how_many in ( (g[0],len(list(g[1]))) for g in grp))
pattern = "1example4...whatit.ry2do"
print(find_regex(pattern))
Output:
\d\t{7}\d\.{3}\t{6}\.\t{2}\d\t{2}
The ternary in the formatting removes not needed {1} from the pattern.
See:
str.isdigit()
If you now replace '\t'with '[a-z]' your regex should fit. You could also replace isdigit check using a regex r'\d' or a in set(string.digits) instead.
pattern = "1example4...whatit.ry2do"
pat = find_regex(pattern).replace(r"\t","[a-z]")
print(pat) # \d[a-z]{7}\d\.{3}[a-z]{6}\.[a-z]{2}\d[a-z]{2}
See
string module for ascii_lowercase and digits
I want to find words in a document using only the letters in a given pattern, but those letters can appear at most once.
Suppose document.txt consists of "abcd abbcd"
What pattern (and what concepts are involved in writing such a pattern) will return "abcd" and not "abbcd"?
You could check if a character appears more than once and then negate the result (in your source code):
split your document into words
check each word with ([a-z])[a-z]*\1 (that matches abbcd, but not abcd)
negate the result
Explanation:
([a-z]) matches any single character
[a-z]* allows none or more characters after the one matched above
\1 is a back reference to the character found at ([a-z])
There were already some good ideas here, but I wanted to offer an example implementation in python. This isn't necessarily optimal, but it should work. Usage would be:
$ python find.py -p abcd < file.txt
And the implementation of find.py is:
import argparse
import sys
from itertools import cycle
parser = argparse.ArgumentParser()
parser.add_argument('-p', required=True)
args = parser.parse_args()
for line in sys.stdin:
for candidate in line.split():
present = dict(zip(args.p, cycle((0,)))) # initialize dict of letter:count
for ch in candidate:
if ch in present:
present[ch] += 1
if all(x <= 1 for x in present.values()):
print(candidate)
This handles your requirement of matching each character in the pattern at most once, i.e. it allows for zero matches. If you wanted to match each character exactly once, you'd change the second-to-last line to:
if all(x == 1 for x in present.values()):
Melpomene is right, regexps are not the best instrument to solve this task. Regexp is essentially a finite state machine. In your case current state can be defined as the combination of presence flags for each of the letters from your alphabet. Thus the total number of internal states in regex will be 2^N where N is the number of allowed letters.
The easiest way to define such regex will be list all possible permutations of available letters (and use ? to eliminate necessity to list shorter sequences). For three letters (a,b,c) regex looks like:
a?b?c?|a?c?b?|b?a?c?|b?c?a?|c?a?b?|c?b?a?
For the four letters (a,b,c,d) it becomes much longer:
a?b?c?d?|a?b?d?c?|a?c?b?d?|a?c?d?b?|a?d?b?c?|a?d?c?b?|b?a?c?d?|b?a?d?c?|b?c?a?d?|b?c?d?a?|b?d?a?c?|b?d?c?a?|c?a?b?d?|c?a?d?b?|c?b?a?d?|c?b?d?a?|c?d?a?b?|c?d?b?a?|d?a?b?c?|d?a?c?b?|d?b?a?c?|d?b?c?a?|d?c?a?b?|d?c?b?a?
As you can see, not that convenient.
The solution without regexps depends on your toolset. I would write a simple program that processes input text word by word. At the start of the word BitSet is created, where each bit represents the presence of the corresponding letter of the desired alphabet. While traversing the word if bit that corresponds to the current letter is zero it becomes one. If already marked bit occurs or letter is not in alphabet, word is skipped. If word is completely evaluated, then it's "valid".
grep -Pwo '[abc]+' | grep -Pv '([abc]).*\1' | awk 'length==3'
where:
first grep: a word composed by the pattern letters...
second grep: ... with no repeated letters ...
awk: ...whose length is the number of letters
I am trying to match the following sample:
ZU2A ZS6D-9 ZT0ER-7 ZR6PJH-12
It is a combination of letters and numbers (alphanumeric).
Here is an explanation:
It will always start with a capital (uppercase) Z
Followed always by only ONE(1) of R,S,T or U "[R|S|T|U]"
Followed always by only ONE(1) number "[0-9]"
Followed always by a minimum of ONE(1) and optionally a maximum of THREE(3) capital (uppercase) letters like this [A-Z]{1,3}
Optionally followed by "-" and a minimum of ONE(1) and a maximum of TWO(2) numbers
At the moment I have this:
Z[R|S|T|U][0-9][A-Z]{1,}(\-)?([0-9]{1,3})
But that does not seem to catch all the samples.
EDIT: Here is a sample of a complete string:
ZU0D>APT314,ZT1ER,WIDE1,ZS3PJ-2,ZR5STU-12*/V:/021414z2610.07S/02814.02Ek067/019/A=005475!w%<!
Any help would be appreciated.
Thank You
Danny
Your main problem is that the whole optional part should be surrounded by one set of parentheses marked with ? (=optional). All in all, you want
Z[RSTU][0-9][A-Z]{1,3}(?:-[0-9]{1,2})?
A couple of extra notes:
In a character group, you can simply list the characters. So for 2 you want either [RSTU] or (?:R|S|T|U).
A group in the form of (?:example) instead of (example) prevents the sub-expression from being returned as a match. It has no effect on which inputs are matched.
You don't need to escape - with a backslash outside of a character class.
Here's an example test case script in Python:
import re
s = r'Z[RSTU][0-9][A-Z]{1,3}(?:-[0-9]{1,2})?'
rex = re.compile(s)
for test in ('ZU2A', 'ZS6D-9', 'ZT0ER-7', 'ZR6PJH-12'):
assert rex.match(test), test
long_test = 'ZU0D>APT314,ZT1ER,WIDE1,ZS3PJ-2,ZR5STU-12*/V:/021414z2610.07S/02814.02Ek067/019/A=005475!w%<!'
found = rex.findall(long_test)
assert found == ['ZU0D', 'ZT1ER', 'ZS3PJ-2', 'ZR5STU-12'], found
I would like to allow these characters [a-z]+\.+[0-9]*\_* (Must contain one or more lowercase alphabetical characters(a-z) and Must contain one or more periods(.) also can contain zero or more digits(0-9), zero or more underscores(_)) , but no others.
I have tried multiple ways without success:
import re
iStrings = str(input('Enter string? '))
iMatch = re.findall(r'[a-z]+\.+[0-9]*\_*', iStrings)
iiMatch = re.findall(r'[~`!#$%^&*()-+={}\[]|\;:\'"<,>.?/]', iStrings)
iiiMatch = iMatch != iiMatch
if iiiMatch:
print(':)')
else:
print(':(')
Another example:
import re
iStrings = str(input('Enter string? '))
iMatch = re.findall(r'[a-z]+\.+[0-9]*\_*', iStrings) not "[~`!#$%^&*()-+={}\[]|\;:\'"<,>.?/]" in iStrings
if iMatch:
print(':)')
else:
print(':(')
Any help would be much appreciated.
Edit: added clarification.
Edit: For additional information: https://forums.uberent.com/threads/beta-mod-changes.51520/page-8#post-939265
allow these characters [a-z]+\.+[0-9]*\_*
First off, [a-z]+ is not "a" character. Neither is [0-9]* nor \_*
I am assuming that you mean you want to allow letters, digits, underscores, dots, plusses and asterisks.
Try this:
^[\w*.+]+$
The \w already matches [a-z], [0-9] and _
The anchors ^ and $ ensure that nothing else is allowed.
From your question I wasn't clear if you wanted to allow a + character to match. If not, remove it from the character class: ^[\w*.]+$. Likewise, remove the * if it isn't needed.
In code:
if re.search(r"^[\w*.+]+$", subject):
# Successful match
else:
# Match attempt failed
EDIT following your comment:
For a string that must contain one or more letter, AND one or more dot, AND zero or more _, AND zero or more digits, we need lookaheads to enforce the one or more conditions. You can use this:
^(?=.*[a-z])(?=.*\.)[\w_.]+$