Regex negative lookaround with optional whitespace - regex

I am trying to find the digits, not succeeded by certain words. I do this using regular expressions in Python3. My guess is that negative lookarounds have to be used, but I'm struggling due to optional whitespaces. See the following example:
'200 word1 some 50 foo and 5foo 30word2'
Note that in reality word1 and word2 can be replaced by a lot of different words, making it much harder to search for a positive match on these words. Therefore it would be easier to exclude the numbers succeeded by foo. The expected result is:
[200, 30]
My try:
s = '200 foo some 50 bar and 5bar 30foo
pattern = r"[0-9]+\s?(?!foo)"
re.findall(pattern, s)
Results in
['200', '50 ', '5', '3']

You may use
import re
s = '200 word1 some 50 foo and 5foo 30word2'
pattern = r"\b[0-9]+(?!\s*foo|[0-9])"
print(re.findall(pattern, s))
# => ['200', '30']
See the Python demo and the regex graph:
Details
\b - a word boundary
[0-9]+ - 1+ ASCII digits only
(?!\s*foo|[0-9]) - not immediately followed with
\s*foo - 0+ whitespaces and foo string
| - or
[0-9] - an ASCII digit.

You should be using the pattern \b[0-9]+(?!\s*foo\b)(?=\D), which says to find all number which are not followed by optional whitespace and the word foo.
s = '200 word1 some 50 foo and 5foo 30word2'
matches = re.findall(r'\b[0-9]+(?!\s*foo\b)(?=\D)', s)
print(matches)
This prints:
['200', '30']

Related

How to replace spaces in "§ 1", "§s 1, 2, [indeterminate number]" all with underscores in regex?

I have a simple regex pattern that replaces a space between paragraph sign and a number with an underscore:
Find: § (\d+)
Replace: §_$1
It works fine and turns "§ 5" into "§_5", but here are cases when there are more numbers following the paragraphs sign, so how to turn "§s 23, 12" also into "§s_1,_2_and_3"? Is it possible with regex?
I have tried modifying the original regex pattern, (§|§-d) (\d+) finds only some of the cases, and I have no idea how to write the replacement.
Thanks in advance!
You can use
re.sub(r'§\w*\s+\d+(?:,\s*\d+)*', lambda x: re.sub(r'\s+', '_', x.group()), text)
The re.sub(r'\s+', '_', x.group() replaces chunks of one or more whitespaces with a single _ char.
See the regex demo, details:
§ - a § char
\w* - zero or more word chars
\s+ - one or more whitespaces
\d+ - one or more digits
(?:,\s*\d+)* - zero or more sequences of a comma, zero or more whitespaces and one or more digits.
See the Python demo:
import re
text = "§s 23, 12"
print(re.sub(r'§\w*\s+\d+(?:,\s*\d+)*', lambda x: re.sub(r'\s+', '_', x.group()), text))
# => §s_23,_12

How to print Hindi character from a string in python using regular expression?

I'm using regex in python and trying to extract 'Hindi' character from the given string and then print it but I'm not able to do so. I'm trying to extract जनवरी12 and जनवरी22 from the string. The code should search for a phrase that starts with जनवरी(or any hindi character) and ends with 12( or any number). Here is the code:
import re
string = "विश्व कप sdsd 12वें संस्करण जनवरी12 or 12जनवरी or जनवरी22"
mo = re.compile(r'[^(^a-zA-Z-0-9)]+\d+')
print(mo.findall(string))
Output:
[' 12', 'वें संस्करण जनवरी12', ' 12', ' जनवरी22']
I know that [^abc] matches any character that isn’t between the brackets and tried to achieve the same with [^(^a-zA-Z-0-9)]+ but the output is not what I expected.
Expected output:
जनवरी12, जनवरी22
Can anyone explain me how this should be done and matching the start and end in Python's regex?
I think you just need a pattern that matches 1+ letters (with 0 or more diacritics after each) and then 1+ digits.
See a Python demo that outputs ['जनवरी12', 'जनवरी22']:
import re
s = "विश्व कप sdsd 12वें संस्करण जनवरी12 or 12जनवरी or जनवरी22"
combining_marks = '[\u0300-\u036F\u0483-\u0489\u0591-\u05BD\u05BF\u05C1\u05C2\u05C4\u05C5\u05C7\u0610-\u061A\u064B-\u065F\u0670\u06D6-\u06DC\u06DF-\u06E4\u06E7\u06E8\u06EA-\u06ED\u0711\u0730-\u074A\u07A6-\u07B0\u07EB-\u07F3\u07FD\u0816-\u0819\u081B-\u0823\u0825-\u0827\u0829-\u082D\u0859-\u085B\u08D3-\u08E1\u08E3-\u0903\u093A-\u093C\u093E-\u094F\u0951-\u0957\u0962\u0963\u0981-\u0983\u09BC\u09BE-\u09C4\u09C7\u09C8\u09CB-\u09CD\u09D7\u09E2\u09E3\u09FE\u0A01-\u0A03\u0A3C\u0A3E-\u0A42\u0A47\u0A48\u0A4B-\u0A4D\u0A51\u0A70\u0A71\u0A75\u0A81-\u0A83\u0ABC\u0ABE-\u0AC5\u0AC7-\u0AC9\u0ACB-\u0ACD\u0AE2\u0AE3\u0AFA-\u0AFF\u0B01-\u0B03\u0B3C\u0B3E-\u0B44\u0B47\u0B48\u0B4B-\u0B4D\u0B56\u0B57\u0B62\u0B63\u0B82\u0BBE-\u0BC2\u0BC6-\u0BC8\u0BCA-\u0BCD\u0BD7\u0C00-\u0C04\u0C3E-\u0C44\u0C46-\u0C48\u0C4A-\u0C4D\u0C55\u0C56\u0C62\u0C63\u0C81-\u0C83\u0CBC\u0CBE-\u0CC4\u0CC6-\u0CC8\u0CCA-\u0CCD\u0CD5\u0CD6\u0CE2\u0CE3\u0D00-\u0D03\u0D3B\u0D3C\u0D3E-\u0D44\u0D46-\u0D48\u0D4A-\u0D4D\u0D57\u0D62\u0D63\u0D82\u0D83\u0DCA\u0DCF-\u0DD4\u0DD6\u0DD8-\u0DDF\u0DF2\u0DF3\u0E31\u0E34-\u0E3A\u0E47-\u0E4E\u0EB1\u0EB4-\u0EBC\u0EC8-\u0ECD\u0F18\u0F19\u0F35\u0F37\u0F39\u0F3E\u0F3F\u0F71-\u0F84\u0F86\u0F87\u0F8D-\u0F97\u0F99-\u0FBC\u0FC6\u102B-\u103E\u1056-\u1059\u105E-\u1060\u1062-\u1064\u1067-\u106D\u1071-\u1074\u1082-\u108D\u108F\u109A-\u109D\u135D-\u135F\u1712-\u1714\u1732-\u1734\u1752\u1753\u1772\u1773\u17B4-\u17D3\u17DD\u180B-\u180D\u1885\u1886\u18A9\u1920-\u192B\u1930-\u193B\u1A17-\u1A1B\u1A55-\u1A5E\u1A60-\u1A7C\u1A7F\u1AB0-\u1ABE\u1B00-\u1B04\u1B34-\u1B44\u1B6B-\u1B73\u1B80-\u1B82\u1BA1-\u1BAD\u1BE6-\u1BF3\u1C24-\u1C37\u1CD0-\u1CD2\u1CD4-\u1CE8\u1CED\u1CF4\u1CF7-\u1CF9\u1DC0-\u1DF9\u1DFB-\u1DFF\u20D0-\u20F0\u2CEF-\u2CF1\u2D7F\u2DE0-\u2DFF\u302A-\u302F\u3099\u309A\uA66F-\uA672\uA674-\uA67D\uA69E\uA69F\uA6F0\uA6F1\uA802\uA806\uA80B\uA823-\uA827\uA880\uA881\uA8B4-\uA8C5\uA8E0-\uA8F1\uA8FF\uA926-\uA92D\uA947-\uA953\uA980-\uA983\uA9B3-\uA9C0\uA9E5\uAA29-\uAA36\uAA43\uAA4C\uAA4D\uAA7B-\uAA7D\uAAB0\uAAB2-\uAAB4\uAAB7\uAAB8\uAABE\uAABF\uAAC1\uAAEB-\uAAEF\uAAF5\uAAF6\uABE3-\uABEA\uABEC\uABED\uFB1E\uFE00-\uFE0F\uFE20-\uFE2F\U000101FD\U000102E0\U00010376-\U0001037A\U00010A01-\U00010A03\U00010A05\U00010A06\U00010A0C-\U00010A0F\U00010A38-\U00010A3A\U00010A3F\U00010AE5\U00010AE6\U00010D24-\U00010D27\U00010F46-\U00010F50\U00011000-\U00011002\U00011038-\U00011046\U0001107F-\U00011082\U000110B0-\U000110BA\U00011100-\U00011102\U00011127-\U00011134\U00011145\U00011146\U00011173\U00011180-\U00011182\U000111B3-\U000111C0\U000111C9-\U000111CC\U0001122C-\U00011237\U0001123E\U000112DF-\U000112EA\U00011300-\U00011303\U0001133B\U0001133C\U0001133E-\U00011344\U00011347\U00011348\U0001134B-\U0001134D\U00011357\U00011362\U00011363\U00011366-\U0001136C\U00011370-\U00011374\U00011435-\U00011446\U0001145E\U000114B0-\U000114C3\U000115AF-\U000115B5\U000115B8-\U000115C0\U000115DC\U000115DD\U00011630-\U00011640\U000116AB-\U000116B7\U0001171D-\U0001172B\U0001182C-\U0001183A\U000119D1-\U000119D7\U000119DA-\U000119E0\U000119E4\U00011A01-\U00011A0A\U00011A33-\U00011A39\U00011A3B-\U00011A3E\U00011A47\U00011A51-\U00011A5B\U00011A8A-\U00011A99\U00011C2F-\U00011C36\U00011C38-\U00011C3F\U00011C92-\U00011CA7\U00011CA9-\U00011CB6\U00011D31-\U00011D36\U00011D3A\U00011D3C\U00011D3D\U00011D3F-\U00011D45\U00011D47\U00011D8A-\U00011D8E\U00011D90\U00011D91\U00011D93-\U00011D97\U00011EF3-\U00011EF6\U00016AF0-\U00016AF4\U00016B30-\U00016B36\U00016F4F\U00016F51-\U00016F87\U00016F8F-\U00016F92\U0001BC9D\U0001BC9E\U0001D165-\U0001D169\U0001D16D-\U0001D172\U0001D17B-\U0001D182\U0001D185-\U0001D18B\U0001D1AA-\U0001D1AD\U0001D242-\U0001D244\U0001DA00-\U0001DA36\U0001DA3B-\U0001DA6C\U0001DA75\U0001DA84\U0001DA9B-\U0001DA9F\U0001DAA1-\U0001DAAF\U0001E000-\U0001E006\U0001E008-\U0001E018\U0001E01B-\U0001E021\U0001E023\U0001E024\U0001E026-\U0001E02A\U0001E130-\U0001E136\U0001E2EC-\U0001E2EF\U0001E8D0-\U0001E8D6\U0001E944-\U0001E94A\U000E0100-\U000E01EF]'
mo = re.compile(r'(?:[^\W\d_]{}*)+\d+'.format(combining_marks))
print(mo.findall(s))
Note that r'(?:[^\W\d_]{}*)+\d+'.format(combining_marks) creates a pattern that matches
(?:[^\W\d_]{}*)+ - one or more occurrences of
[^\W\d_] - any Unicode base letter (if you want to disallow ASCII letters, add (?![A-Za-z]) right before this pattern)
{}* - zero or more occurrences of combining_marks
\d+ - 1+ digits
So, if you want to avoid matching ASCII letters, in the above code, use
r'(?:(?![A-Za-z])[^\W\d_]{}*)+\d+'

re.findall() equivalent to a string.split() loop with inner search

Is there a regex string <regex> such that re.findall(r'<regex>', doc) will return the same result as the following code?
doc = ' th_is is stuff. and2 3things if y4ou kn-ow ___ whaaaat iii mean)'
new_doc = []
for word in re.split(r'\s+', doc.strip()):
if not re.search(r'(.)\1{2,}|[_\d\W]+', word):
new_doc.append(word)
>>> new_doc
['is', 'if']
Perhaps, your current way of getting the matches is the best.
You can't do that without some additional operation, e.g. list comprehension, because re.findall with a pattern that contains a capturing group outputs the captured substrings in the resulting list.
Thus, you may either add an outer capturing group and use re.findall or use re.finditer and get the first group using
(?<!\S)(?!\S*(\S)\1{2}|\S*(?!\s)[_\d\W])\S+
See this regex demo.
Details
(?<!\S) - a whitespace or start of string must be immediately to the left of the current location
(?!\S*(\S)\1{2}|\S*(?!\s)[_\d\W]) - there cannot be 3 same non-whitespace chars or a char that is a _, digit or any non-word char other than whitespace after any 0+ non-whitespace chars immediately to the right the current location
\S+ - 1+ non-whitespace chars.
See the Python demo:
import re
doc = ' th_is is stuff. and2 3things if y4ou kn-ow ___ whaaaat iii mean)'
new_doc = [x.group(0) for x in re.finditer(r'(?<!\S)(?!\S*(\S)\1{2}|\S*(?!\s)[_\d\W])\S+', doc)]
print(new_doc) # => ['is', 'if']
new_doc2 = re.findall(r'(?<!\S)((?!\S*(\S)\2{2}|\S*(?!\s)[_\d\W])\S+)', doc)
print([x[0] for x in new_doc2]) # => ['is', 'if']

Regex not able to identify emails with special characters?

Problem:
I wrote a regex to identify email addresses in the text.But it is not recognizing the emails with special character like -.So I modified the regex to match emails with special characters.Now it is not matching normal email.s
regex = r"\"?([-a-zA-Z0-9.`?{}]+#\w+\.\w+)\"?"
TEXT = "To address parsed is bilgi "
DETECT = "bilgi#kleintoys.com"
NOT_DETECT = "bilgi#klei-ntoys.com"
Modified:
regex = r"\"?([-a-zA-Z0-9.`?{}]+#\w+\-+\w+\.\w+)\"?"
TEXT = "To address parsed is bilgi "
DETECT = "bilgi#klei-ntoys.com"
NOT_DETECT = "bilgi#kleintoys.com"
Is there any regex combining both these two regex to match both emails.
like
bilgi#klei-ntoys.com
bilgi#kleintoys.com
You could make a much more loose regex.
Here is a proposition that does match both addresses:
[a-zA-Z\d]+#.+\..{,3}
Let's break it down:
[a-zA-Z\d]+#.+\.[a-zA-Z\d]{,3}
[a-zA-Z\d] Match any alphanumerical character...
+ ... at least once
# Match the arobase
.+ Match any character at least once...
\. ... before a dot
[a-zA-Z\d]{,3} Then check at least three alphanumerical characters
Checking with Python:
>>> import re
>>> s = "bilgi#kle-intoys.com"
>>> re.match("[a-zA-Z\d]+#.+\.[a-zA-Z\d]{,3}", s)
<_sre.SRE_Match object; span=(0, 20), match='bilgi#kle-intoys.com'>
>>> s = "bilgi#kleintoys.com"
>>> re.match("[a-zA-Z\d]+#.+\.[a-zA-Z\d]{,3}", s)
<_sre.SRE_Match object; span=(0, 19), match='bilgi#kleintoys.com'>
To make your pattern work, you need to add a part that will match 0+ sequences of - and then 1 or more word chars, (?:-\w+)*:
"?([-a-zA-Z0-9.`?{}]+#\w+(?:-\w+)*\.\w+)"?
^^^^^^^^^
See the regex demo.
Details
"? - an optional "
([-a-zA-Z0-9.`?{}]+#\w+(?:-\w+)*.\w+) - Group 1 (what re.findall will output):
[-a-zA-Z0-9.`?{}]+ - 1 or more chars defined in the character class (-, ASCII letters, digits, ., `, ?, {, } (note you might want to restrict this part to start with any letter and then also match _, like [^\W\d_][-\w.`?{}]*)
# - a #
\w+ - 1 or more letters/digits/_
(?:-\w+)* - 0+ sequences of - and then 1 or more letters/digits/_
\. - a dot
\w+ - 1 or more letters/digits/_
"? - an optional "
Python demo:
import re
rx = r"\"?([-a-zA-Z0-9.`?{}]+#\w+(?:-\w+)*\.\w+)\"?"
s = """ "bilgi#kleintoys.com" and bilgi#klei-ntoys.com"""
print(re.findall(rx, s))
# => ['bilgi#kleintoys.com', 'bilgi#klei-ntoys.com']
Use * instead of +:
r"\"?([-a-zA-Z0-9.`?{}]+#\w+\-*\w+\.\w+)\"?"
A star after the hyphen matches zero or more occurrences. You have a plus which matches at least one hyphen. BTW, instead of \-* you may use [-]*. Between the square brackets any other special characters, besides -, can be inserted.

Regex, sub out all alphabets

I have a situation where I need only the numbers and the dash, like
2007-24. I know how to use regular expressions to replace numbers, but
how would you regex all alphabets except the dash between the numbers.
Input:"CLOSED ORD NO 2007-24"
re.sub("[/-/^0-9/-]", '', self.text, flags=re.M)
You may use
re.sub(r'(\d+-\d+)|.', r'\1', self.text, flags=re.S)
See the regex demo
Regex details
(\d+-\d+) - Group 1: one or more digits, -, 1+ digits
| - or
. - any one char
The \1 is a backreference to the Group 1 value (to keep it in the result).
See Python demo:
import re
s = "CLOSED ORD NO 2007-24"
print( re.sub(r"(\d+-\d+)|.", r'\1', s, flags=re.S) )
# => 2007-24