I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.
I want to find the pattern of a regular expression from a character string. My goal is to be able to reuse this pattern to find a string in another context but checking the pattern.
from sting "1example4whatitry2do",
I want to find pattern like: [0-9]{1}[a-z]{7}[0-9]{1}[a-z]{8}[0-9]{1}[a-z]{2}
So I can reuse this pattern to find this other example of sting 2eytmpxe8wsdtmdry1uo
I can do a loop on each caracter, but I hope there is a fast way
Thanks for your help !
You can puzzle this out:
go over your strings characterwise
if the character is a text character add a 't' to a list
if the character is a number add a 'd' to a list
if the character is something else, add itself to the list
Use itertools.groupby to group consecutive identical letters into groups.
Create a pattern from the group-key and the length of the group using some string literal formatting.
Code:
from itertools import groupby
from string import ascii_lowercase
lower_case = set(ascii_lowercase) # set for faster lookup
def find_regex(p):
cum = []
for c in p:
if c.isdigit():
cum.append("d")
elif c in lower_case:
cum.append("t")
else:
cum.append(c)
grp = groupby(cum)
return ''.join(f'\\{what}{{{how_many}}}'
if how_many>1 else f'\\{what}'
for what,how_many in ( (g[0],len(list(g[1]))) for g in grp))
pattern = "1example4...whatit.ry2do"
print(find_regex(pattern))
Output:
\d\t{7}\d\.{3}\t{6}\.\t{2}\d\t{2}
The ternary in the formatting removes not needed {1} from the pattern.
See:
str.isdigit()
If you now replace '\t'with '[a-z]' your regex should fit. You could also replace isdigit check using a regex r'\d' or a in set(string.digits) instead.
pattern = "1example4...whatit.ry2do"
pat = find_regex(pattern).replace(r"\t","[a-z]")
print(pat) # \d[a-z]{7}\d\.{3}[a-z]{6}\.[a-z]{2}\d[a-z]{2}
See
string module for ascii_lowercase and digits
with open(searchfile) as f:
pattern = "\.?(?P<sentence>.*?\(([A-Za-z0-9_]+)\).*?)\."
for line in f:
match = re.search(pattern, line)
if match != None:
print match.group("sentence")
I am trying to extract every sentence that contains an acronym in parenthesis (essentially 2-4 letter all caps in parenthesis.
In: Here is an (ABC) example. Do not include this sentence. Include this (AB) one. And (AVCD) this one.
Out: Here is an (ABC) example. Include this (AB) one. And (AVCD) this one.
You can use this:
[^.]*?\([A-Z]{2,4}\)[^.]*\.
But note that it is a particulary inefficient way, since the pattern starts with a very permissive subpattern. You can correct that a little by adding a kind of anchor at the begining:
(?:(?<=.)|^)[^.]*?\([A-Z]{2,4}\)[^.]*\.
Unfortunatly, even with this anchor, the regex engine must check the two alternatives for the most of the characters of the string.
A better approach might be to find substrings starting with the acronym until the end of the sentence and dots, and then to extract substrings using the end offset of each results:
#!/usr/bin/python
import re
txt = 'Here is an (ABC) example. Do not include this sentence. Include this (AB) one. And (AVCD) this one.'
pattern = re.compile(r'([!.?])(?=\s)|\([A-Z]{2,4}\)[^.]*(?:\.|$)')
offset = 0
result = ''
for m in pattern.finditer(txt):
if (m.group(1)==None):
result += txt[offset:m.end()]
offset = m.end()
print result
Note: you can be sure that a dot stands for the end of a sentence, it can be something else.
a little more efficient pattern
([^.(]++\([^.)]++\)[^.)]++\.)
Demo
I want to split a text into it's single words using regular expressions. The obvious solution would be to use the regex \\b unfortunately this one does split words also on the hyphen.
So I am searching an expression doing exactly the same as the \\b but does not split on hyphens.
Thanks for your help.
Example:
String s = "This is my text! It uses some odd words like user-generated and need therefore a special regex.";
String [] b = s.split("\\b+");
for (int i = 0; i < b.length; i++){
System.out.println(b[i]);
}
Output:
This
is
my
text
!
It
uses
some
odd
words
like
user
-
generated
and
need
therefore
a
special
regex
.
Expected output:
...
like
user-generated
and
....
#Matmarbon solution is already quite close, but not 100% fitting it gives me
...
like
user-
generated
and
....
This should do the trick, even if lookaheads are not available:
[^\w\-]+
Also not you but somebody who needs this for another purpose (i.e. inserting something) this is more of an equivalent to the \b-solutions:
([^\w\-]|$|^)+
because:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
--- http://www.regular-expressions.info/wordboundaries.html
You can use this:
(?<!-)\\b(?!-)
I'm working on String Calculator code kata with Groovy.
There are a lot of scenarios that solve for achieve the solution:
I have:
//;\n1;2;3
//#\n1#2#3
//+\n1+2+3
//*\n1*2*3
//?\n1?2?3
I want:
1,2,3
My implementation:
String numbers = "//;\n1;2;3"
numbers.find(/\/\/\S[\n]/) { match ->
def delimeter = match[2]
numbers = numbers.minus(match).replaceAll(delimeter, ",")
}
With this solution I solved the first and second expressions, but I don't know how solve the others expressions.
java.util.regex.PatternSyntaxException: Dangling meta character '+' near index 0
The problem is that we must also consider any symbol that match with the sintaxt of regular expressions like +, * or ?
Finally I have the solution:
String numbers = "//+\n1+2+3"
numbers.find(/(?s)\/\/(.*)\n/) { match ->
def delimeter = match[1] // also match[0][2]
numbers = numbers.minus(match[0]).replace(delimeter, ",")
}
An important point (?s):
In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.
Dotall mode can also be enabled via the embedded flag expression (?s)
But really the problem was here: .replace(delimeter, ",")
//(.)\n(\d)\1(\d)\1(\d)
Need to use links.
(.) - math thiw any character, and \1 - math thiw character on it\
For next example you can apply this: //\[(.*?)\]\\n(\d)\1(\d)\1(\d)
It math thiw
//[*]\n12**3
And last: //\[(.*?)\]\[(.*?)\]\\n(\d)\1(\d)\2(\d)
//[*][%%]\n1*2%%3
And finaly:
//\[(.*?)\](?:\[(.*?)\])?\\n(\d)\1(\d)(?:\2|\1)(\d)
I think it's can work ewerythere
P.S : (\d) you can replace what you want. I think you need (\d*)