How to use lookahead and $ with Regex - regex

I am trying to get the name of the resource, I will share with you the regexr url
My actual regular expression: ([^/]+)(?=\..*)
My example: https://res-3.cloudinary.com/ngxcoder/image/upload/f_auto,q_auto/v1/blog-images/5oonz9.jpg
I'm trying to get just 5oonz9
I tried to include $, but I don't know why it doesn't work

You can use:
^.+\/(.+)\..+$
^.+ - From the start, match as many characters as possible
\/ - Match a literal /.
(.+) - Match one or more characters and capture them in a group
\. - Match a literal .
.+$ - Match one or more characters at the end of the string (the extension)
Live demo here.

You don't need a capture group, just a match:
(?<=\/)[^\/.]+(?=\.[^\/.]+$)
Demo
We can write the expression in free-spacing mode to make it self-documenting:
(?<= # begin a negative lookbehind
\/ # match '/'
) # end negative lookbehind
[^\]+ # match one or more characters other than '/'
(?= # begin a positive lookahead
\. # match '.'
[^\/]+ # match one or more characters other than '/'
$ # match end of string
) # end the positive lookahead
You should not use a regex for this, however, as Python provides os.path:
import os
str = 'https://res-3.cloudinary.com/ngxcoder/image/'\
'upload/f_auto,q_auto/v1/blog-images/5oonz9.jpg'
base = os.path.basename(str)
print(os.path.splitext(base)[0])
#=> "5oonz9"
Here base #=> "5oonz9.jpg".
See it in action
Doc

There are many ways:
Couple below using python:
#using regexp:
>>> file_name='https://res-3.cloudinary.com/ngxcoder/image/upload/f_auto,q_auto/v1/blog-images/5oonz9.jpg'
>>> regexpr = r".*/([^\/]+).jpg$"
>>> re.match(regexpr, file_name).group(1)
'5oonz9'
>>>
#to get any file name:
>>> regexpr = r".*/([^\/]+)$"
>>> re.match(regexpr, file_name).group(1)
'5oonz9.jpg'
#if interested, here is one using split & take last
>>> (file_name.split("/")[-1]).split(".")[0]
'5oonz9'
>>>

I found a more straightforward solution thanks to other answers:
([^\/]+)(?=\.[^\/.]+$)
Explanation:
([^\/]+) don't match 1 or more '/'
(?=\.) look ahead for '.'
[^\/.]+ don't match 1 or more '/' and '.' (This was the key!!)
$ end of the string

Related

Regex Expression to remove "autoplay" parameter in url

I'm trying to match the url https://youtube.com/embed/id and its parameters i.e ?start=10&autoplay=1, but I need the autoplay parameter removed or set to 0.
These are some example urls and what I want the results to look like:
http://www.youtube.com/embed/JW5meKfy3fY?autoplay=1
I want to remove the autoplay parameter and its value:
http://www.youtube.com/embed/JW5meKfy3fY
2nd example
http://www.youtube.com/embed/JW5meKfy3fY?start=10&autoplay=1
results should be
http://www.youtube.com/embed/JW5meKfy3fY?start=10
I have tried (https?:\/\/www.youtube.com\/embed\/[a-zA-Z0-9\\-_]+)(\?[^\t\n\f\r \"']*)(\bautoplay=[01]\b&?) and replace with $1$2, but it matches with a trailing ? and & in example 1 and 2 respectively. Also, it doesn't match at all for a url like
http://www.youtube.com/embed/JW5meKfy3fY
I have the regex and examples on here
NB:
The string I am working on contains HTML with one or more youtube urls in it, so I don't think I can easily use go's net/url package to parse the url.
You're asking for a regex but I think you'd be better off using Go's "net/url" package. Something like this:
import "net/url"
//...
u, _ := url.Parse("http://www.youtube.com/embed/JW5meKfy3fY?start=10&autoplay=1")
q := u.Query()
q.Del("autoplay")
u.RawQuery = q.Encode()
clean_url_string = u.String()
In real life you'd want to handle errors from u.Parse of course.
Here's a solution that ensures a valid page URI. Simply match this and only return capture group 1 and 3.
Edit: The pattern is not elegant but it ensures no stale ampersands stay. The previous solution was more elegant and albeit wouldn't break anything, isn't worth the tradeoff imo.
Pattern
(https?:\/\/www\.youtube\.com\/embed\/[^?]+\?.*)(&autoplay=[01]|autoplay=[01]&?)(.*)
See the demo here.
As the OP has linked to a regex tester that employs the the PCRE (PHP) engine I offer a PCRE-compatible solution. The one token I've used in the regular expression below that is not widely supported in other regex engines is \K (though it is supported by Perl, Ruby, Python's PyPI regex module, R with Perl=TRUE and possibly other engines.
\K causes the regex engine to reset the beginning of the match to the current location in the string and to discard any previously-matched characters in the match it returns (if there is one).
With one caveat you can replace matches of the following regular expression with empty strings.
(?x) # assert 'extended'/'free spacing' mode
\bhttps?:\/\/www.youtube.com\/embed\/
# match literal
(?=.*autoplay=[01]) # positive lookahead asserts 'autoplay='
# followed by '1' or '2' appears later in
# the string
[a-zA-Z0-9\\_-]+ # match 1+ of the chars in the char class
[^\t\n\f\r \"']* # match 0+ chars other than those in the
# char class
(?<![?&]) # negative lookbehind asserts that previous
# char was neither '?' nor '&'
(?: # begin non-capture group
(?=\?) # use positive lookahead to assert next char
# is a '?'
(?: # begin a non-capture group
(?=.*autoplay=[01]&)
# positive lookahead asserts 'autoplay='
# followed by '1' or '2', then '&' appears
# later in the string
\? # match '?'
)? # end non-capture group and make it optional
\K # reset start of match to current location
# and discard all previously-matched chars
\?? # optionally match '?'
autoplay=[01]&? # match 'autoplay=' followed by '1' or '2',
# optionally followed by '&'
| # or
(?=&) # positive lookahead asserts next char is '&'
\K # reset start of match to current location
# and discard all previously-matched chars
&autoplay=[01]&? # match '&autoplay=' followed by '1' or '2',
# optionally followed by '&'
) # end non-capture group
The one limitation is that it fails to match all instances of .autoplay=.. if more than one such substring appears in the string.
I wrote this expression with the x flag, called extended or free spacing mode, to be able to make it self-documenting.
Start your engine!

Parsing digits and decimals out of string with re

I have a string that looks like this:
'Home Cookie viewed item "yada_yada.mov" (22.4338.241384081)'
I need to parse the last set of numbers, the ones between the last period and the closing paren (in this case, 241384081) out of the string, keeping in mind that there may be one or more sets of parenthesis in the filename "yada_yada.mov."
So far I have this:
mo = re.match('.*([0-9])\)$', data1)
...where data1 is the string. But that is only returning the very last digit.
Any help, please?
Thanks!
You may use
(\d[\d.]*)\)$
See the regex demo.
Details
(\d[\d.]*) - Capturing group 1: a digit and then any amount of . and digits, 0 or more times
\) - a )
$ - end of string.
See the Python demo:
import re
s='Home Cookie viewed item "yada_yada.mov" (22.4338.241384081)'
m = re.search(r'(\d[\d.]*)\)$', s)
if m:
print(m.group(1)) # => 22.4338.241384081
# print(m.group(1).replace(".", "")) # => 224338241384081
Alternative patterns:
(\d+(?:\.\d+)*)\)$ # To match digits and then 0 or more repetitions of . + digits
(\d+(?:\.\d+)*)\)\s*$ # To allow any 0+ trailing whitespaces

Regex not able to identify emails with special characters?

Problem:
I wrote a regex to identify email addresses in the text.But it is not recognizing the emails with special character like -.So I modified the regex to match emails with special characters.Now it is not matching normal email.s
regex = r"\"?([-a-zA-Z0-9.`?{}]+#\w+\.\w+)\"?"
TEXT = "To address parsed is bilgi "
DETECT = "bilgi#kleintoys.com"
NOT_DETECT = "bilgi#klei-ntoys.com"
Modified:
regex = r"\"?([-a-zA-Z0-9.`?{}]+#\w+\-+\w+\.\w+)\"?"
TEXT = "To address parsed is bilgi "
DETECT = "bilgi#klei-ntoys.com"
NOT_DETECT = "bilgi#kleintoys.com"
Is there any regex combining both these two regex to match both emails.
like
bilgi#klei-ntoys.com
bilgi#kleintoys.com
You could make a much more loose regex.
Here is a proposition that does match both addresses:
[a-zA-Z\d]+#.+\..{,3}
Let's break it down:
[a-zA-Z\d]+#.+\.[a-zA-Z\d]{,3}
[a-zA-Z\d] Match any alphanumerical character...
+ ... at least once
# Match the arobase
.+ Match any character at least once...
\. ... before a dot
[a-zA-Z\d]{,3} Then check at least three alphanumerical characters
Checking with Python:
>>> import re
>>> s = "bilgi#kle-intoys.com"
>>> re.match("[a-zA-Z\d]+#.+\.[a-zA-Z\d]{,3}", s)
<_sre.SRE_Match object; span=(0, 20), match='bilgi#kle-intoys.com'>
>>> s = "bilgi#kleintoys.com"
>>> re.match("[a-zA-Z\d]+#.+\.[a-zA-Z\d]{,3}", s)
<_sre.SRE_Match object; span=(0, 19), match='bilgi#kleintoys.com'>
To make your pattern work, you need to add a part that will match 0+ sequences of - and then 1 or more word chars, (?:-\w+)*:
"?([-a-zA-Z0-9.`?{}]+#\w+(?:-\w+)*\.\w+)"?
^^^^^^^^^
See the regex demo.
Details
"? - an optional "
([-a-zA-Z0-9.`?{}]+#\w+(?:-\w+)*.\w+) - Group 1 (what re.findall will output):
[-a-zA-Z0-9.`?{}]+ - 1 or more chars defined in the character class (-, ASCII letters, digits, ., `, ?, {, } (note you might want to restrict this part to start with any letter and then also match _, like [^\W\d_][-\w.`?{}]*)
# - a #
\w+ - 1 or more letters/digits/_
(?:-\w+)* - 0+ sequences of - and then 1 or more letters/digits/_
\. - a dot
\w+ - 1 or more letters/digits/_
"? - an optional "
Python demo:
import re
rx = r"\"?([-a-zA-Z0-9.`?{}]+#\w+(?:-\w+)*\.\w+)\"?"
s = """ "bilgi#kleintoys.com" and bilgi#klei-ntoys.com"""
print(re.findall(rx, s))
# => ['bilgi#kleintoys.com', 'bilgi#klei-ntoys.com']
Use * instead of +:
r"\"?([-a-zA-Z0-9.`?{}]+#\w+\-*\w+\.\w+)\"?"
A star after the hyphen matches zero or more occurrences. You have a plus which matches at least one hyphen. BTW, instead of \-* you may use [-]*. Between the square brackets any other special characters, besides -, can be inserted.

Selecting if no delimiter, and no selecting if it is

I have string like "smth 2sg. smth", and sometimes "smth 2sg.| smth.".
What mask should I use for selecting "2sg." if string does not contains"|", and select nothing if string does contains "|"?
I have 2 methods. They both use something called a Negative Lookahead, which is used like so:
(?!data)
When this is inserted into a RegEx, it means if data exists, the RegEx will not match.
More info on the Negative Lookahead can be found here
Method 1 (shorter)
Just capture 2sg.
Try this RegEx:
(\dsg\.)(?!\|)
Use (\d+... if the number could be longer than 1 digit
Live Demo on RegExr
How it works:
( # To capture (2sg.)
\d # Digit (2)
sg # (sg)
\. # . (Dot)
)
(?!\|) # Do not match if contains |
Method 2 (longer but safer)
Match the whole string and capture 2sg.
Try this RegEx:
^\w+\s*(\dsg\.)(?!\|)\s*\w+\.?$
Use (\d+sg... if the number could be longer than 1 digit
Live Demo on RegExr
How it works:
^ # String starts with ...
\w+\s* # Letters then Optional Whitespace (smth )
( # To capture (2sg.)
\d # Digit (2)
sg # (sg)
\. # . (Dot)
)
(?!\|) # Do not match if contains |
\s* # Optional Whitespace
\w+ # Letters (smth)
\.? # Optional . (Dot)
$ # ... Strings ends with
Something like this might work for you:
(\d*sg\.)(?!\|)
It assumes that there is(or there is no)number followed by sg. and not followed by |.
^.*(\dsg\.)[^\|]*$
Explanation:
^ : starts from the beginning of the string
.* : accepts any number of initial characters (even nothing)
(\dsg\.) : looks for the group of digit + "sg."
[^\|]* : considers any number of following characters except for |
$ : stops at the end of the string
You can now select your string by getting the first group from your regex
Try:
(\d+sg.(?!\|))
depending on your programming environment, it can be little bit different but will get your result.
For more information see Negative Lookahead

How to match similar groups that repeat, but aren't the same

I have a number of strings relating to products. Each of these have reference numbers and I want to create a regex that picks up if different reference numbers are mentioned more than one time. So, given the following example:
"AB01 MyProduct" >>> No match - because there is only one ID
"AB02 MyOtherProduct" >>> No match - because there is only one ID
"AB03 YetAnotherProduct" >>> No match - because there is only one ID
"AnAccessory for AB01, AB02, AB03 or AB101" >>> Matches!!
"AB02 MyOtherProduct, MyOtherProduct called the AB02" >>> No match - because the codes are the same
Can anyone give me a clue?
If your regex engine supports negative lookaheads, this would do the trick:
(AB\d+).*?(?!\1)AB\d+
It matches if there are two sequences matching AB\d+ and the second one is not the same as the first one (ensured by the negative lookahead).
Explained:
( # start capture group 1
AB # match `AB` literally
\d+ # match one or more digits
) # end capture group one
.*? # match any sequence of characters, non-greedy
(?! # start negative lookahead, match this position if it does not match
\1 # whatever was captured in capture group 1
) # end lookahead
AB # match `AB` literally
\d+ # match one or more digits
Tests (JavaScript):
> var pattern = /(AB\d+).*?(?!\1)AB\d+/;
> pattern.test("AB01 MyProduct")
false
> pattern.test("AnAccessory for AB01, AB02, AB03 or AB101")
true
> pattern.test("AB02 MyOtherProduct, MyOtherProduct called the AB02")
false