Python 2.7 Regex Tokenizer Implementation Not Working [duplicate] - regex

This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 5 years ago.
I created a regular expression to match tokens in a german text which is of type string.
My Regular expression is working as expected using regex101.com. Here is a link of my regex with an example sentence: My regex + example on regex101.com
So I implemented it in python 2.7 like this:
GERMAN_TOKENIZER = r'''(?x) # set flag to allow verbose regex
([A-ZÄÖÜ]\.)+ # abbrevations including ÄÖÜ
|\d+([.,]\d+)?([€$%])? # numbers, allowing commas as seperators and € as currency
|[\wäöü]+ # matches normal words
|\.\.\. # ellipsis
|[][.,;\"'?():-_'!] # matches special characters including !
'''
def tokenize_german_text(text):
'''
Takes a text of type string and
tokenizes the text
'''
matchObject = re.findall(GERMAN_TOKENIZER, text)
pass
tokenize_german_text(u'Das ist ein Deutscher Text! Er enthält auch Währungen, 10€')
Result:
When I was debugging this I found out that the matchObject is only a list containing 11 entries with empty characters. Why is it not working as expected and how can I fix this?

re.findall() collects only the matches in capturing groups (unless there are no capturing groups in your regex, in which case it captures each match).
So your regex matches several times, but every time the match is one where no capturing group is participating. Remove the capturing groups, and you'll see results. Also, place the - at the end of the character class unless you actually want to match the range of characters between : and _ (but not the - itself):
GERMAN_TOKENIZER = r'''(?x) # set flag to allow verbose regex
(?:[A-ZÄÖÜ]\.)+ # abbrevations including ÄÖÜ
|\d+(?:[.,]\d+)?[€$%]? # numbers, allowing commas as seperators and € as currency
|[\wäöü]+ # matches normal words
|\.\.\. # ellipsis
|[][.,;\"'?():_'!-] # matches special characters including !
'''
Result:
['Das', 'ist', 'ein', 'Deutscher', 'Text', '!', 'Er', 'enthält', 'auch', 'Währungen', ',', '10€']

Related

Shorten Regular Expression (\n) [duplicate]

I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.

Regex function to find all and only 6 digit numeric string ignoring spaces if any any between [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I have HTML source page as text file.
I need to read file and find out only those numeric strings which have 6 continous digits and can have a space in between those 6 digits
Eg
209 016 - should be come up in search result and as 400013(space removed)
209016 - should also come up in search and unaltered as 209016
any numeric string more then 6 digits long should not come up in search eg 20901677,209016#223, 29016,
I think this can be achieved by regex but I was not able to
A soln in regex is more desirable but anything else is also welcome
To match 6 digits with any number of spaces in between, you may use the following pattern:
\b(?:\d[ ]*?){6}\b
Or if you want to reject it when it's followed by an #, you may use:
\b(?:\d[ ]*?){6}\b(?!#)
Regex demo.
Then, you can use the replace method to remove the space characters.
Python example:
import re
regex = r"\b(?:\d[ ]*?){6}\b(?!#)"
test_str = ("209016 \n"
"209 016\n"
"20901677','209016#223', '29016")
matches = re.finditer(regex, test_str, re.MULTILINE)
for match in matches:
print (match.group().replace(" ", ""))
Output:
209016
209016
Try it online.
You can try the following regex:
\b(?<!#)\d(?:\s*\d){5}\b(?!#)
demo: https://regex101.com/r/ZCcDmF/2/
But note that you might have to modify your boundaries if you need to exclude more than the #. it will become something like:
\b(?<!#|other char I need to exclude|another one|...)\d(?:\s*\d){5}\b(?!#|other char I need to exclude|another one|...)
where you have to replace other char I need to exclude, another one,... by the characters.

Regular Expression (.*)#(.*) [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
What does this regular expression stand for (.*)#(.*). I came to know that (.*) matches any character with period any number of times.
But I couldn't understand the meaning properly. Also, what does two of them separated by # mean?
.*#.* matches any string containing the # character
Example of strings that this pattern would match
#
#qe
asrrd#
qw3e#as112d
(.*)#(.*) would just return whatever is before and after the # character
Example:
for # would return two empty strings '' , ''
for #qe rule will return '' and 'qe'
for asrrd# would return 'asrrd' and ''
for qw3e#as112d would return 'qw3e' and 'as112d'
(.*)#(.*) can match any of the following:
#, .#., ..#., jbkbhjh...#...njbh ...
* means one or many characters.
so this regex means a # symbol enclosed by any number of chars
Also, what does two of them separated by # mean
Answer to this is: the # symbol is required for a text to match this regex

Parse value using Regex [duplicate]

This question already has answers here:
How to capture an arbitrary number of groups in JavaScript Regexp?
(5 answers)
Closed 7 years ago.
I have a long strings taken from a VCF file such as (These are truncated for example purpose):
chr1 11189845 COSM462604;COSM893813 G C,T 158.16 PASS AF=0,0;AO=0,0;DP=1201;FAO=0,0;FDP=1201;FR=.;
chr1 11190804 COSM180789 C T 134.06 PASS AF=0;AO=0;DP=1016;FAO=0;FDP=1018;FR=.;FRO=1018;
I want to to write a single regex to return all values of FAO on a given line.
The valid format for FAO is: FAO=SomeNumber; or FAO=SomeNumber, SomeNumber, SomeNumber, etc...;
Is there a way to write a REGEX capture group that takes into account both a single value and an infinite number of values separated by a comma until you see a ';'?
I've tried
FAO=((([0-9]+);)|(([0-9]+),([0-9])+))
But it only takes into account up to 2 numbers and I need matcher group 1 to be the first value, matcher group 2 to be the second etc...
You can use a negated character class: [^;]+ This says to match any characters that are not a semicolon. Since it's a greedy match it will continue until it sees the first semicolon.
var strings = [
'chr1 11189845 COSM462604;COSM893813 G C,T 158.16 PASS AF=0,0;AO=0,0;DP=1201;FAO=0,0;FDP=1201;FR=.;',
'chr1 11190804 COSM180789 C T 134.06 PASS AF=0;AO=0;DP=1016;FAO=0;FDP=1018;FR=.;FRO=1018;'
];
strings.forEach(function(str) {
alert(str.match(/(FAO=[^;]+)/)[1]);
});
From there you can edit the group match to only grab the values /FAO=([^;]+)/ and then you can split that value on the comma delimiter.
var strings = [
'chr1 11189845 COSM462604;COSM893813 G C,T 158.16 PASS AF=0,0;AO=0,0;DP=1201;FAO=0,0;FDP=1201;FR=.;',
'chr1 11190804 COSM180789 C T 134.06 PASS AF=0;AO=0;DP=1016;FAO=0;FDP=1018;FR=.;FRO=1018;'
];
strings.forEach(function(str) {
alert(str.match(/FAO=([^;]+)/)[1].split(','));
});
As stated in this SO answer it's not possible in most languages to have an arbitrary number of group matches.
you could use a regex like this
FAO=([0-9]+(,[0-9]+)*);
the outer parentheses allow you to extract the value or values with the first matching group.
EDIT
considering that you want to capture the individual values with different matching groups this approach won't work (capturing groups inside * will only capture the last match). see the accepted answer to this question for a solution.
EDIT 2
see this demo based on that answer for an example of a pcre regex that will match each number with the same capturing group.
(?:FAO=|\G,)\K(\d+)
note that not all regex flavours support \G and \K. \G matches the end of the previous match (or the start of the string), and \K resets the start of current match.

Regular expression help - comma delimited string

I don't write many regular expressions so I'm going to need some help on the one.
I need a regular expression that can validate that a string is an alphanumeric comma delimited string.
Examples:
123, 4A67, GGG, 767 would be valid.
12333, 78787&*, GH778 would be invalid
fghkjhfdg8797< would be invalid
This is what I have so far, but isn't quite right: ^(?=.*[a-zA-Z0-9][,]).*$
Any suggestions?
Sounds like you need an expression like this:
^[0-9a-zA-Z]+(,[0-9a-zA-Z]+)*$
Posix allows for the more self-descriptive version:
^[[:alnum:]]+(,[[:alnum:]]+)*$
^[[:alnum:]]+([[:space:]]*,[[:space:]]*[[:alnum:]]+)*$ // allow whitespace
If you're willing to admit underscores, too, search for entire words (\w+):
^\w+(,\w+)*$
^\w+(\s*,\s*\w+)*$ // allow whitespaces around the comma
Try this pattern: ^([a-zA-Z0-9]+,?\s*)+$
I tested it with your cases, as well as just a single number "123". I don't know if you will always have a comma or not.
The [a-zA-Z0-9]+ means match 1 or more of these symbols
The ,? means match 0 or 1 commas (basically, the comma is optional)
The \s* handles 1 or more spaces after the comma
and finally the outer + says match 1 or more of the pattern.
This will also match
123 123 abc (no commas) which might be a problem
This will also match 123, (ends with a comma) which might be a problem.
Try the following expression:
/^([a-z0-9\s]+,)*([a-z0-9\s]+){1}$/i
This will work for:
test
test, test
test123,Test 123,test
I would strongly suggest trimming the whitespaces at the beginning and end of each item in the comma-separated list.
You seem to be lacking repetition. How about:
^(?:[a-zA-Z0-9 ]+,)*[a-zA-Z0-9 ]+$
I'm not sure how you'd express that in VB.Net, but in Python:
>>> import re
>>> x [ "123, $a67, GGG, 767", "12333, 78787&*, GH778" ]
>>> r = '^(?:[a-zA-Z0-9 ]+,)*[a-zA-Z0-9 ]+$'
>>> for s in x:
... print re.match( r, s )
...
<_sre.SRE_Match object at 0xb75c8218>
None
>>>>
You can use shortcuts instead of listing the [a-zA-Z0-9 ] part, but this is probably easier to understand.
Analyzing the highlights:
[a-zA-Z0-9 ]+ : capture one or more (but not zero) of the listed ranges, and space.
(?:[...]+,)* : In non-capturing parenthesis, match one or more of the characters, plus a comma at the end. Match such sequences zero or more times. Capturing zero times allows for no comma.
[...]+ : capture at least one of these. This does not include a comma. This is to ensure that it does not accept a trailing comma. If a trailing comma is acceptable, then the expression is easier: ^[a-zA-Z0-9 ,]+
Yes, when you want to catch comma separated things where a comma at the end is not legal, and the things match to $LONGSTUFF, you have to repeat $LONGSTUFF:
$LONGSTUFF(,$LONGSTUFF)*
If $LONGSTUFF is really long and contains comma repeated items itself etc., it might be a good idea to not build the regexp by hand and instead rely on a computer for doing that for you, even if it's just through string concatenation. For example, I just wanted to build a regular expression to validate the CPUID parameter of a XEN configuration file, of the ['1:a=b,c=d','2:e=f,g=h'] type. I... believe this mostly fits the bill: (whitespace notwithstanding!)
xend_fudge_item_re = r"""
e[a-d]x= #register of the call return value to fudge
(
0x[0-9A-F]+ | #either hardcode the reply
[10xks]{32} #or edit the bitfield directly
)
"""
xend_string_item_re = r"""
(0x)?[0-9A-F]+: #leafnum (the contents of EAX before the call)
%s #one fudge
(,%s)* #repeated multiple times
""" % (xend_fudge_item_re, xend_fudge_item_re)
xend_syntax = re.compile(r"""
\[ #a list of
'%s' #string elements
(,'%s')* #repeated multiple times
\]
$ #and nothing else
""" % (xend_string_item_re, xend_string_item_re), re.VERBOSE | re.MULTILINE)
Try ^(?!,)((, *)?([a-zA-Z0-9])\b)*$
Step by step description:
Don't match a beginning comma (good for the upcoming "loop").
Match optional comma and spaces.
Match characters you like.
The match of a word boundary make sure that a comma is necessary if more arguments are stacked in string.
Please use - ^((([a-zA-Z0-9\s]){1,45},)+([a-zA-Z0-9\s]){1,45})$
Here, I have set max word size to 45, as longest word in english is 45 characters, can be changed as per requirement