Regex, sub out all alphabets - regex

I have a situation where I need only the numbers and the dash, like
2007-24. I know how to use regular expressions to replace numbers, but
how would you regex all alphabets except the dash between the numbers.
Input:"CLOSED ORD NO 2007-24"
re.sub("[/-/^0-9/-]", '', self.text, flags=re.M)

You may use
re.sub(r'(\d+-\d+)|.', r'\1', self.text, flags=re.S)
See the regex demo
Regex details
(\d+-\d+) - Group 1: one or more digits, -, 1+ digits
| - or
. - any one char
The \1 is a backreference to the Group 1 value (to keep it in the result).
See Python demo:
import re
s = "CLOSED ORD NO 2007-24"
print( re.sub(r"(\d+-\d+)|.", r'\1', s, flags=re.S) )
# => 2007-24

Related

How to use lookahead and $ with Regex

I am trying to get the name of the resource, I will share with you the regexr url
My actual regular expression: ([^/]+)(?=\..*)
My example: https://res-3.cloudinary.com/ngxcoder/image/upload/f_auto,q_auto/v1/blog-images/5oonz9.jpg
I'm trying to get just 5oonz9
I tried to include $, but I don't know why it doesn't work
You can use:
^.+\/(.+)\..+$
^.+ - From the start, match as many characters as possible
\/ - Match a literal /.
(.+) - Match one or more characters and capture them in a group
\. - Match a literal .
.+$ - Match one or more characters at the end of the string (the extension)
Live demo here.
You don't need a capture group, just a match:
(?<=\/)[^\/.]+(?=\.[^\/.]+$)
Demo
We can write the expression in free-spacing mode to make it self-documenting:
(?<= # begin a negative lookbehind
\/ # match '/'
) # end negative lookbehind
[^\]+ # match one or more characters other than '/'
(?= # begin a positive lookahead
\. # match '.'
[^\/]+ # match one or more characters other than '/'
$ # match end of string
) # end the positive lookahead
You should not use a regex for this, however, as Python provides os.path:
import os
str = 'https://res-3.cloudinary.com/ngxcoder/image/'\
'upload/f_auto,q_auto/v1/blog-images/5oonz9.jpg'
base = os.path.basename(str)
print(os.path.splitext(base)[0])
#=> "5oonz9"
Here base #=> "5oonz9.jpg".
See it in action
Doc
There are many ways:
Couple below using python:
#using regexp:
>>> file_name='https://res-3.cloudinary.com/ngxcoder/image/upload/f_auto,q_auto/v1/blog-images/5oonz9.jpg'
>>> regexpr = r".*/([^\/]+).jpg$"
>>> re.match(regexpr, file_name).group(1)
'5oonz9'
>>>
#to get any file name:
>>> regexpr = r".*/([^\/]+)$"
>>> re.match(regexpr, file_name).group(1)
'5oonz9.jpg'
#if interested, here is one using split & take last
>>> (file_name.split("/")[-1]).split(".")[0]
'5oonz9'
>>>
I found a more straightforward solution thanks to other answers:
([^\/]+)(?=\.[^\/.]+$)
Explanation:
([^\/]+) don't match 1 or more '/'
(?=\.) look ahead for '.'
[^\/.]+ don't match 1 or more '/' and '.' (This was the key!!)
$ end of the string

Regex negative lookaround with optional whitespace

I am trying to find the digits, not succeeded by certain words. I do this using regular expressions in Python3. My guess is that negative lookarounds have to be used, but I'm struggling due to optional whitespaces. See the following example:
'200 word1 some 50 foo and 5foo 30word2'
Note that in reality word1 and word2 can be replaced by a lot of different words, making it much harder to search for a positive match on these words. Therefore it would be easier to exclude the numbers succeeded by foo. The expected result is:
[200, 30]
My try:
s = '200 foo some 50 bar and 5bar 30foo
pattern = r"[0-9]+\s?(?!foo)"
re.findall(pattern, s)
Results in
['200', '50 ', '5', '3']
You may use
import re
s = '200 word1 some 50 foo and 5foo 30word2'
pattern = r"\b[0-9]+(?!\s*foo|[0-9])"
print(re.findall(pattern, s))
# => ['200', '30']
See the Python demo and the regex graph:
Details
\b - a word boundary
[0-9]+ - 1+ ASCII digits only
(?!\s*foo|[0-9]) - not immediately followed with
\s*foo - 0+ whitespaces and foo string
| - or
[0-9] - an ASCII digit.
You should be using the pattern \b[0-9]+(?!\s*foo\b)(?=\D), which says to find all number which are not followed by optional whitespace and the word foo.
s = '200 word1 some 50 foo and 5foo 30word2'
matches = re.findall(r'\b[0-9]+(?!\s*foo\b)(?=\D)', s)
print(matches)
This prints:
['200', '30']

Parsing digits and decimals out of string with re

I have a string that looks like this:
'Home Cookie viewed item "yada_yada.mov" (22.4338.241384081)'
I need to parse the last set of numbers, the ones between the last period and the closing paren (in this case, 241384081) out of the string, keeping in mind that there may be one or more sets of parenthesis in the filename "yada_yada.mov."
So far I have this:
mo = re.match('.*([0-9])\)$', data1)
...where data1 is the string. But that is only returning the very last digit.
Any help, please?
Thanks!
You may use
(\d[\d.]*)\)$
See the regex demo.
Details
(\d[\d.]*) - Capturing group 1: a digit and then any amount of . and digits, 0 or more times
\) - a )
$ - end of string.
See the Python demo:
import re
s='Home Cookie viewed item "yada_yada.mov" (22.4338.241384081)'
m = re.search(r'(\d[\d.]*)\)$', s)
if m:
print(m.group(1)) # => 22.4338.241384081
# print(m.group(1).replace(".", "")) # => 224338241384081
Alternative patterns:
(\d+(?:\.\d+)*)\)$ # To match digits and then 0 or more repetitions of . + digits
(\d+(?:\.\d+)*)\)\s*$ # To allow any 0+ trailing whitespaces

re.findall() equivalent to a string.split() loop with inner search

Is there a regex string <regex> such that re.findall(r'<regex>', doc) will return the same result as the following code?
doc = ' th_is is stuff. and2 3things if y4ou kn-ow ___ whaaaat iii mean)'
new_doc = []
for word in re.split(r'\s+', doc.strip()):
if not re.search(r'(.)\1{2,}|[_\d\W]+', word):
new_doc.append(word)
>>> new_doc
['is', 'if']
Perhaps, your current way of getting the matches is the best.
You can't do that without some additional operation, e.g. list comprehension, because re.findall with a pattern that contains a capturing group outputs the captured substrings in the resulting list.
Thus, you may either add an outer capturing group and use re.findall or use re.finditer and get the first group using
(?<!\S)(?!\S*(\S)\1{2}|\S*(?!\s)[_\d\W])\S+
See this regex demo.
Details
(?<!\S) - a whitespace or start of string must be immediately to the left of the current location
(?!\S*(\S)\1{2}|\S*(?!\s)[_\d\W]) - there cannot be 3 same non-whitespace chars or a char that is a _, digit or any non-word char other than whitespace after any 0+ non-whitespace chars immediately to the right the current location
\S+ - 1+ non-whitespace chars.
See the Python demo:
import re
doc = ' th_is is stuff. and2 3things if y4ou kn-ow ___ whaaaat iii mean)'
new_doc = [x.group(0) for x in re.finditer(r'(?<!\S)(?!\S*(\S)\1{2}|\S*(?!\s)[_\d\W])\S+', doc)]
print(new_doc) # => ['is', 'if']
new_doc2 = re.findall(r'(?<!\S)((?!\S*(\S)\2{2}|\S*(?!\s)[_\d\W])\S+)', doc)
print([x[0] for x in new_doc2]) # => ['is', 'if']

Regex not able to identify emails with special characters?

Problem:
I wrote a regex to identify email addresses in the text.But it is not recognizing the emails with special character like -.So I modified the regex to match emails with special characters.Now it is not matching normal email.s
regex = r"\"?([-a-zA-Z0-9.`?{}]+#\w+\.\w+)\"?"
TEXT = "To address parsed is bilgi "
DETECT = "bilgi#kleintoys.com"
NOT_DETECT = "bilgi#klei-ntoys.com"
Modified:
regex = r"\"?([-a-zA-Z0-9.`?{}]+#\w+\-+\w+\.\w+)\"?"
TEXT = "To address parsed is bilgi "
DETECT = "bilgi#klei-ntoys.com"
NOT_DETECT = "bilgi#kleintoys.com"
Is there any regex combining both these two regex to match both emails.
like
bilgi#klei-ntoys.com
bilgi#kleintoys.com
You could make a much more loose regex.
Here is a proposition that does match both addresses:
[a-zA-Z\d]+#.+\..{,3}
Let's break it down:
[a-zA-Z\d]+#.+\.[a-zA-Z\d]{,3}
[a-zA-Z\d] Match any alphanumerical character...
+ ... at least once
# Match the arobase
.+ Match any character at least once...
\. ... before a dot
[a-zA-Z\d]{,3} Then check at least three alphanumerical characters
Checking with Python:
>>> import re
>>> s = "bilgi#kle-intoys.com"
>>> re.match("[a-zA-Z\d]+#.+\.[a-zA-Z\d]{,3}", s)
<_sre.SRE_Match object; span=(0, 20), match='bilgi#kle-intoys.com'>
>>> s = "bilgi#kleintoys.com"
>>> re.match("[a-zA-Z\d]+#.+\.[a-zA-Z\d]{,3}", s)
<_sre.SRE_Match object; span=(0, 19), match='bilgi#kleintoys.com'>
To make your pattern work, you need to add a part that will match 0+ sequences of - and then 1 or more word chars, (?:-\w+)*:
"?([-a-zA-Z0-9.`?{}]+#\w+(?:-\w+)*\.\w+)"?
^^^^^^^^^
See the regex demo.
Details
"? - an optional "
([-a-zA-Z0-9.`?{}]+#\w+(?:-\w+)*.\w+) - Group 1 (what re.findall will output):
[-a-zA-Z0-9.`?{}]+ - 1 or more chars defined in the character class (-, ASCII letters, digits, ., `, ?, {, } (note you might want to restrict this part to start with any letter and then also match _, like [^\W\d_][-\w.`?{}]*)
# - a #
\w+ - 1 or more letters/digits/_
(?:-\w+)* - 0+ sequences of - and then 1 or more letters/digits/_
\. - a dot
\w+ - 1 or more letters/digits/_
"? - an optional "
Python demo:
import re
rx = r"\"?([-a-zA-Z0-9.`?{}]+#\w+(?:-\w+)*\.\w+)\"?"
s = """ "bilgi#kleintoys.com" and bilgi#klei-ntoys.com"""
print(re.findall(rx, s))
# => ['bilgi#kleintoys.com', 'bilgi#klei-ntoys.com']
Use * instead of +:
r"\"?([-a-zA-Z0-9.`?{}]+#\w+\-*\w+\.\w+)\"?"
A star after the hyphen matches zero or more occurrences. You have a plus which matches at least one hyphen. BTW, instead of \-* you may use [-]*. Between the square brackets any other special characters, besides -, can be inserted.