Regex not able to identify emails with special characters? - regex

Problem:
I wrote a regex to identify email addresses in the text.But it is not recognizing the emails with special character like -.So I modified the regex to match emails with special characters.Now it is not matching normal email.s
regex = r"\"?([-a-zA-Z0-9.`?{}]+#\w+\.\w+)\"?"
TEXT = "To address parsed is bilgi "
DETECT = "bilgi#kleintoys.com"
NOT_DETECT = "bilgi#klei-ntoys.com"
Modified:
regex = r"\"?([-a-zA-Z0-9.`?{}]+#\w+\-+\w+\.\w+)\"?"
TEXT = "To address parsed is bilgi "
DETECT = "bilgi#klei-ntoys.com"
NOT_DETECT = "bilgi#kleintoys.com"
Is there any regex combining both these two regex to match both emails.
like
bilgi#klei-ntoys.com
bilgi#kleintoys.com

You could make a much more loose regex.
Here is a proposition that does match both addresses:
[a-zA-Z\d]+#.+\..{,3}
Let's break it down:
[a-zA-Z\d]+#.+\.[a-zA-Z\d]{,3}
[a-zA-Z\d] Match any alphanumerical character...
+ ... at least once
# Match the arobase
.+ Match any character at least once...
\. ... before a dot
[a-zA-Z\d]{,3} Then check at least three alphanumerical characters
Checking with Python:
>>> import re
>>> s = "bilgi#kle-intoys.com"
>>> re.match("[a-zA-Z\d]+#.+\.[a-zA-Z\d]{,3}", s)
<_sre.SRE_Match object; span=(0, 20), match='bilgi#kle-intoys.com'>
>>> s = "bilgi#kleintoys.com"
>>> re.match("[a-zA-Z\d]+#.+\.[a-zA-Z\d]{,3}", s)
<_sre.SRE_Match object; span=(0, 19), match='bilgi#kleintoys.com'>

To make your pattern work, you need to add a part that will match 0+ sequences of - and then 1 or more word chars, (?:-\w+)*:
"?([-a-zA-Z0-9.`?{}]+#\w+(?:-\w+)*\.\w+)"?
^^^^^^^^^
See the regex demo.
Details
"? - an optional "
([-a-zA-Z0-9.`?{}]+#\w+(?:-\w+)*.\w+) - Group 1 (what re.findall will output):
[-a-zA-Z0-9.`?{}]+ - 1 or more chars defined in the character class (-, ASCII letters, digits, ., `, ?, {, } (note you might want to restrict this part to start with any letter and then also match _, like [^\W\d_][-\w.`?{}]*)
# - a #
\w+ - 1 or more letters/digits/_
(?:-\w+)* - 0+ sequences of - and then 1 or more letters/digits/_
\. - a dot
\w+ - 1 or more letters/digits/_
"? - an optional "
Python demo:
import re
rx = r"\"?([-a-zA-Z0-9.`?{}]+#\w+(?:-\w+)*\.\w+)\"?"
s = """ "bilgi#kleintoys.com" and bilgi#klei-ntoys.com"""
print(re.findall(rx, s))
# => ['bilgi#kleintoys.com', 'bilgi#klei-ntoys.com']

Use * instead of +:
r"\"?([-a-zA-Z0-9.`?{}]+#\w+\-*\w+\.\w+)\"?"
A star after the hyphen matches zero or more occurrences. You have a plus which matches at least one hyphen. BTW, instead of \-* you may use [-]*. Between the square brackets any other special characters, besides -, can be inserted.

Related

How to use lookahead and $ with Regex

I am trying to get the name of the resource, I will share with you the regexr url
My actual regular expression: ([^/]+)(?=\..*)
My example: https://res-3.cloudinary.com/ngxcoder/image/upload/f_auto,q_auto/v1/blog-images/5oonz9.jpg
I'm trying to get just 5oonz9
I tried to include $, but I don't know why it doesn't work
You can use:
^.+\/(.+)\..+$
^.+ - From the start, match as many characters as possible
\/ - Match a literal /.
(.+) - Match one or more characters and capture them in a group
\. - Match a literal .
.+$ - Match one or more characters at the end of the string (the extension)
Live demo here.
You don't need a capture group, just a match:
(?<=\/)[^\/.]+(?=\.[^\/.]+$)
Demo
We can write the expression in free-spacing mode to make it self-documenting:
(?<= # begin a negative lookbehind
\/ # match '/'
) # end negative lookbehind
[^\]+ # match one or more characters other than '/'
(?= # begin a positive lookahead
\. # match '.'
[^\/]+ # match one or more characters other than '/'
$ # match end of string
) # end the positive lookahead
You should not use a regex for this, however, as Python provides os.path:
import os
str = 'https://res-3.cloudinary.com/ngxcoder/image/'\
'upload/f_auto,q_auto/v1/blog-images/5oonz9.jpg'
base = os.path.basename(str)
print(os.path.splitext(base)[0])
#=> "5oonz9"
Here base #=> "5oonz9.jpg".
See it in action
Doc
There are many ways:
Couple below using python:
#using regexp:
>>> file_name='https://res-3.cloudinary.com/ngxcoder/image/upload/f_auto,q_auto/v1/blog-images/5oonz9.jpg'
>>> regexpr = r".*/([^\/]+).jpg$"
>>> re.match(regexpr, file_name).group(1)
'5oonz9'
>>>
#to get any file name:
>>> regexpr = r".*/([^\/]+)$"
>>> re.match(regexpr, file_name).group(1)
'5oonz9.jpg'
#if interested, here is one using split & take last
>>> (file_name.split("/")[-1]).split(".")[0]
'5oonz9'
>>>
I found a more straightforward solution thanks to other answers:
([^\/]+)(?=\.[^\/.]+$)
Explanation:
([^\/]+) don't match 1 or more '/'
(?=\.) look ahead for '.'
[^\/.]+ don't match 1 or more '/' and '.' (This was the key!!)
$ end of the string

why adding group to my regex changes what it catches

I have the line:
[asos-qa:2021:5]#0 Row[info=[ts=-9223372036854775808] ]: 6, 23 |
I want to get the first word: asos-qa, so I tried this regex: ^\[\S*?(:|]) and it gets me: [asos-qa:.
So in order to get only the word without the other characters I tried to add a group (python syntax): ^\[(?P<app_id>\S*)?(:|]) but for some reason it returns [asos-qa:2021:5].
What am I doing wrong?
Your ^\[(?P<app_id>\S*)?(:|]) regex returns [asos-qa:2021:5] because \S* matches any zero or more non-whitespace chars greedily up to the last available :or ] in the current chunk of non-whitespace chars, ? you used is applied to the whole (?P<app_id>\S*) group pattern and is also greedy, i.e. the regex engine tries at least once to match the group pattern.
You need
^\[(?P<app_id>[^]\s:]+)
See the regex demo. Details:
^ - start of string
\[ - a [ char
(?P<app_id>[^]\s:]+) - Group "app_id": any one or more chars other than ], whitespace and :. NOTE: ] does not need to be escaped when it is the first char in the character class.
See the Python demo:
import re
pattern = r"^\[(?P<app_id>[^]\s:]+)"
text = "[asos-qa:2021:5]#0 Row[info=[ts=-9223372036854775808] ]: 6, 23 |"
m = re.search(pattern, text)
if m:
print( m.group(1) )
# => asos-qa
Your pattern uses a greedy \S which matches any non whitespace character.
You can make it non greedy using \S*? like ^\[(?P<app_id>\S*?)(:|]) which will have the value in capture group 1.
Or you can use a negated character class not matching : assuming the closing ] will be there.
^\[(?P<app_id>[^:]+)
Regex demo | Python demo
Example code
import re
pattern = r"\[(?P<app_id>[^:]+)"
s = "[asos-qa:2021:5]#0 Row[info=[ts=-9223372036854775808] ]: 6, 23 |"
match = re.match(pattern, s)
if match:
print(match.group("app_id"))
Output
asos-qa
Or matching only words characters with an optional hyphen in between:
^\[(?P<app_id>\w+(?:-\w+)*)[^]\[]*]
Regex demo

How to print Hindi character from a string in python using regular expression?

I'm using regex in python and trying to extract 'Hindi' character from the given string and then print it but I'm not able to do so. I'm trying to extract जनवरी12 and जनवरी22 from the string. The code should search for a phrase that starts with जनवरी(or any hindi character) and ends with 12( or any number). Here is the code:
import re
string = "विश्व कप sdsd 12वें संस्करण जनवरी12 or 12जनवरी or जनवरी22"
mo = re.compile(r'[^(^a-zA-Z-0-9)]+\d+')
print(mo.findall(string))
Output:
[' 12', 'वें संस्करण जनवरी12', ' 12', ' जनवरी22']
I know that [^abc] matches any character that isn’t between the brackets and tried to achieve the same with [^(^a-zA-Z-0-9)]+ but the output is not what I expected.
Expected output:
जनवरी12, जनवरी22
Can anyone explain me how this should be done and matching the start and end in Python's regex?
I think you just need a pattern that matches 1+ letters (with 0 or more diacritics after each) and then 1+ digits.
See a Python demo that outputs ['जनवरी12', 'जनवरी22']:
import re
s = "विश्व कप sdsd 12वें संस्करण जनवरी12 or 12जनवरी or जनवरी22"
combining_marks = '[\u0300-\u036F\u0483-\u0489\u0591-\u05BD\u05BF\u05C1\u05C2\u05C4\u05C5\u05C7\u0610-\u061A\u064B-\u065F\u0670\u06D6-\u06DC\u06DF-\u06E4\u06E7\u06E8\u06EA-\u06ED\u0711\u0730-\u074A\u07A6-\u07B0\u07EB-\u07F3\u07FD\u0816-\u0819\u081B-\u0823\u0825-\u0827\u0829-\u082D\u0859-\u085B\u08D3-\u08E1\u08E3-\u0903\u093A-\u093C\u093E-\u094F\u0951-\u0957\u0962\u0963\u0981-\u0983\u09BC\u09BE-\u09C4\u09C7\u09C8\u09CB-\u09CD\u09D7\u09E2\u09E3\u09FE\u0A01-\u0A03\u0A3C\u0A3E-\u0A42\u0A47\u0A48\u0A4B-\u0A4D\u0A51\u0A70\u0A71\u0A75\u0A81-\u0A83\u0ABC\u0ABE-\u0AC5\u0AC7-\u0AC9\u0ACB-\u0ACD\u0AE2\u0AE3\u0AFA-\u0AFF\u0B01-\u0B03\u0B3C\u0B3E-\u0B44\u0B47\u0B48\u0B4B-\u0B4D\u0B56\u0B57\u0B62\u0B63\u0B82\u0BBE-\u0BC2\u0BC6-\u0BC8\u0BCA-\u0BCD\u0BD7\u0C00-\u0C04\u0C3E-\u0C44\u0C46-\u0C48\u0C4A-\u0C4D\u0C55\u0C56\u0C62\u0C63\u0C81-\u0C83\u0CBC\u0CBE-\u0CC4\u0CC6-\u0CC8\u0CCA-\u0CCD\u0CD5\u0CD6\u0CE2\u0CE3\u0D00-\u0D03\u0D3B\u0D3C\u0D3E-\u0D44\u0D46-\u0D48\u0D4A-\u0D4D\u0D57\u0D62\u0D63\u0D82\u0D83\u0DCA\u0DCF-\u0DD4\u0DD6\u0DD8-\u0DDF\u0DF2\u0DF3\u0E31\u0E34-\u0E3A\u0E47-\u0E4E\u0EB1\u0EB4-\u0EBC\u0EC8-\u0ECD\u0F18\u0F19\u0F35\u0F37\u0F39\u0F3E\u0F3F\u0F71-\u0F84\u0F86\u0F87\u0F8D-\u0F97\u0F99-\u0FBC\u0FC6\u102B-\u103E\u1056-\u1059\u105E-\u1060\u1062-\u1064\u1067-\u106D\u1071-\u1074\u1082-\u108D\u108F\u109A-\u109D\u135D-\u135F\u1712-\u1714\u1732-\u1734\u1752\u1753\u1772\u1773\u17B4-\u17D3\u17DD\u180B-\u180D\u1885\u1886\u18A9\u1920-\u192B\u1930-\u193B\u1A17-\u1A1B\u1A55-\u1A5E\u1A60-\u1A7C\u1A7F\u1AB0-\u1ABE\u1B00-\u1B04\u1B34-\u1B44\u1B6B-\u1B73\u1B80-\u1B82\u1BA1-\u1BAD\u1BE6-\u1BF3\u1C24-\u1C37\u1CD0-\u1CD2\u1CD4-\u1CE8\u1CED\u1CF4\u1CF7-\u1CF9\u1DC0-\u1DF9\u1DFB-\u1DFF\u20D0-\u20F0\u2CEF-\u2CF1\u2D7F\u2DE0-\u2DFF\u302A-\u302F\u3099\u309A\uA66F-\uA672\uA674-\uA67D\uA69E\uA69F\uA6F0\uA6F1\uA802\uA806\uA80B\uA823-\uA827\uA880\uA881\uA8B4-\uA8C5\uA8E0-\uA8F1\uA8FF\uA926-\uA92D\uA947-\uA953\uA980-\uA983\uA9B3-\uA9C0\uA9E5\uAA29-\uAA36\uAA43\uAA4C\uAA4D\uAA7B-\uAA7D\uAAB0\uAAB2-\uAAB4\uAAB7\uAAB8\uAABE\uAABF\uAAC1\uAAEB-\uAAEF\uAAF5\uAAF6\uABE3-\uABEA\uABEC\uABED\uFB1E\uFE00-\uFE0F\uFE20-\uFE2F\U000101FD\U000102E0\U00010376-\U0001037A\U00010A01-\U00010A03\U00010A05\U00010A06\U00010A0C-\U00010A0F\U00010A38-\U00010A3A\U00010A3F\U00010AE5\U00010AE6\U00010D24-\U00010D27\U00010F46-\U00010F50\U00011000-\U00011002\U00011038-\U00011046\U0001107F-\U00011082\U000110B0-\U000110BA\U00011100-\U00011102\U00011127-\U00011134\U00011145\U00011146\U00011173\U00011180-\U00011182\U000111B3-\U000111C0\U000111C9-\U000111CC\U0001122C-\U00011237\U0001123E\U000112DF-\U000112EA\U00011300-\U00011303\U0001133B\U0001133C\U0001133E-\U00011344\U00011347\U00011348\U0001134B-\U0001134D\U00011357\U00011362\U00011363\U00011366-\U0001136C\U00011370-\U00011374\U00011435-\U00011446\U0001145E\U000114B0-\U000114C3\U000115AF-\U000115B5\U000115B8-\U000115C0\U000115DC\U000115DD\U00011630-\U00011640\U000116AB-\U000116B7\U0001171D-\U0001172B\U0001182C-\U0001183A\U000119D1-\U000119D7\U000119DA-\U000119E0\U000119E4\U00011A01-\U00011A0A\U00011A33-\U00011A39\U00011A3B-\U00011A3E\U00011A47\U00011A51-\U00011A5B\U00011A8A-\U00011A99\U00011C2F-\U00011C36\U00011C38-\U00011C3F\U00011C92-\U00011CA7\U00011CA9-\U00011CB6\U00011D31-\U00011D36\U00011D3A\U00011D3C\U00011D3D\U00011D3F-\U00011D45\U00011D47\U00011D8A-\U00011D8E\U00011D90\U00011D91\U00011D93-\U00011D97\U00011EF3-\U00011EF6\U00016AF0-\U00016AF4\U00016B30-\U00016B36\U00016F4F\U00016F51-\U00016F87\U00016F8F-\U00016F92\U0001BC9D\U0001BC9E\U0001D165-\U0001D169\U0001D16D-\U0001D172\U0001D17B-\U0001D182\U0001D185-\U0001D18B\U0001D1AA-\U0001D1AD\U0001D242-\U0001D244\U0001DA00-\U0001DA36\U0001DA3B-\U0001DA6C\U0001DA75\U0001DA84\U0001DA9B-\U0001DA9F\U0001DAA1-\U0001DAAF\U0001E000-\U0001E006\U0001E008-\U0001E018\U0001E01B-\U0001E021\U0001E023\U0001E024\U0001E026-\U0001E02A\U0001E130-\U0001E136\U0001E2EC-\U0001E2EF\U0001E8D0-\U0001E8D6\U0001E944-\U0001E94A\U000E0100-\U000E01EF]'
mo = re.compile(r'(?:[^\W\d_]{}*)+\d+'.format(combining_marks))
print(mo.findall(s))
Note that r'(?:[^\W\d_]{}*)+\d+'.format(combining_marks) creates a pattern that matches
(?:[^\W\d_]{}*)+ - one or more occurrences of
[^\W\d_] - any Unicode base letter (if you want to disallow ASCII letters, add (?![A-Za-z]) right before this pattern)
{}* - zero or more occurrences of combining_marks
\d+ - 1+ digits
So, if you want to avoid matching ASCII letters, in the above code, use
r'(?:(?![A-Za-z])[^\W\d_]{}*)+\d+'

re.findall() equivalent to a string.split() loop with inner search

Is there a regex string <regex> such that re.findall(r'<regex>', doc) will return the same result as the following code?
doc = ' th_is is stuff. and2 3things if y4ou kn-ow ___ whaaaat iii mean)'
new_doc = []
for word in re.split(r'\s+', doc.strip()):
if not re.search(r'(.)\1{2,}|[_\d\W]+', word):
new_doc.append(word)
>>> new_doc
['is', 'if']
Perhaps, your current way of getting the matches is the best.
You can't do that without some additional operation, e.g. list comprehension, because re.findall with a pattern that contains a capturing group outputs the captured substrings in the resulting list.
Thus, you may either add an outer capturing group and use re.findall or use re.finditer and get the first group using
(?<!\S)(?!\S*(\S)\1{2}|\S*(?!\s)[_\d\W])\S+
See this regex demo.
Details
(?<!\S) - a whitespace or start of string must be immediately to the left of the current location
(?!\S*(\S)\1{2}|\S*(?!\s)[_\d\W]) - there cannot be 3 same non-whitespace chars or a char that is a _, digit or any non-word char other than whitespace after any 0+ non-whitespace chars immediately to the right the current location
\S+ - 1+ non-whitespace chars.
See the Python demo:
import re
doc = ' th_is is stuff. and2 3things if y4ou kn-ow ___ whaaaat iii mean)'
new_doc = [x.group(0) for x in re.finditer(r'(?<!\S)(?!\S*(\S)\1{2}|\S*(?!\s)[_\d\W])\S+', doc)]
print(new_doc) # => ['is', 'if']
new_doc2 = re.findall(r'(?<!\S)((?!\S*(\S)\2{2}|\S*(?!\s)[_\d\W])\S+)', doc)
print([x[0] for x in new_doc2]) # => ['is', 'if']

Regex, sub out all alphabets

I have a situation where I need only the numbers and the dash, like
2007-24. I know how to use regular expressions to replace numbers, but
how would you regex all alphabets except the dash between the numbers.
Input:"CLOSED ORD NO 2007-24"
re.sub("[/-/^0-9/-]", '', self.text, flags=re.M)
You may use
re.sub(r'(\d+-\d+)|.', r'\1', self.text, flags=re.S)
See the regex demo
Regex details
(\d+-\d+) - Group 1: one or more digits, -, 1+ digits
| - or
. - any one char
The \1 is a backreference to the Group 1 value (to keep it in the result).
See Python demo:
import re
s = "CLOSED ORD NO 2007-24"
print( re.sub(r"(\d+-\d+)|.", r'\1', s, flags=re.S) )
# => 2007-24