Ignore everything after slash if it's a number - regex

I'm trying to ignore everything after a slash if it's a number -
http://www.example.com/123abc/456/ABC/789/
Desired output is
http://www.example.com/123abc/
I have tried the following so far -
(https?:\/\/.*)(?=/\d+).*
which gives me -
http://www.example.com/123abc/456/ABC/
Many Thanks!

I think you want
(https?:\/\/.*?)(?=/\d+\/).*
// ^ ^^
Making the repetition non-greedy, and enforcing the whole directory to be a number (otherwise /123abc… would already match it). Maybe you also want to move the first slash from the lookahead into the matching group, so that your result has the trailing slash.

The .* is greedy and will try to match as much as possible. The 789 existence allows for a match of everything up to it. Instead you can use.
(https?:\/\/.*?)(?=/\d+).*
The ? makes the .* relucant, so it will match as little as possible to satisfy the expression.
However, this doesn't fulfill the requirement you described which is actually "Ignore everything after the second slash if it is a number." You can use (in your specific case):
(https?:\/\/.*?\/.*?\/)(?=\d+).*

Related

Match specific pattern that does not contain other pattern in one expression

I'm looking for a regex to use in nginx location matching, that would match a specified end pattern not being anywhere preceded by a specified other pattern.
Like, I have files:
webgl-0.4.0-alpha.1-gzip-dev/streaming-wasm-gzip-dev.wasm.framework.unityweb
webgl-0.4.0-alpha.1-gzip-dev/streaming-wasm-gzip-dev.data.unityweb
webgl-0.4.0-alpha.1-gzip/streaming-wasm-gzip.wasm.framework.unityweb
webgl-0.4.0-alpha.1-gzip/streaming-wasm-gzip.data.unityweb
I want to match all \.unityweb except those that are anywhere preceded by dev. Basically, I need to match last two lines. I cannot hardcode it, as the files/directories might be named arbitrary.
The usual ((?!dev\/).)*$ doesn't suffice, because it still gets the ends. (?<!dev) also cannot be added anwyhere as it will only match directly before.
I am out of clues and also out of regex fu!
The solution does not have to be strictly regex, might be nginx based too.
It might have been asked before, but I cannot seem to know the correct keywords to find it.
Try
^(?!.*?dev\/.*).+\.unityweb$
See the demo here
Description:
^ From the start of the line
(?! _______ ) Negative Lookahead
.*?dev\/ Match any character any amount of times, until you reach dev followed by a slash
.* Match any characters any amount of times
Negative lookahead closes
.+ Match any character, more than once
\.unityweb - until you reach .unityweb
$ End of the line
Use the full match for what you need
EDIT
Just realised that you also state a contradiction in your question, as you say you don't want to match anything preceded by dev/ but you also want to match the first two examples you gave.
That can be done by changing the negative lookahead to a positive lookahead:
^(?=.*?dev\/.*).+\.unityweb$
See the demo here
You can use this
^(?!.*dev.*\.unityweb)(?=.*\.unityweb).*$
Demo

Having difficulty in a understanding regex backtracking

I was browsing through the regex tagged questions on SO when i came accross this problem,
A regex for a url was needed, the url begins with domain.com/advertorials/
The regex should match the following scenarios,
domain.com/advertorials
domain.com/advertorials?test=true
domain.com/advertorials/
domain.com/advertorials/?test=true
but not this,
domain.com/advertorials/version1?test=true
I came up with this regex advertorials\/?(?:(?!version)(.*))
This should work, but it doesnt for the last case. Looking at the debugger in regex101.com,
i see that after matching 's/' it matches 'version' word character by character and ultimately matches but since this is negative lookahead the condition fails. And this is the part i dont understand after failing it backtracks to before the '/' in 's/' and not after 's/'.
Is this how its supposed to work?? Can anyone help me understand?
(here's the demo link: https://regex101.com/r/ww3HR8/1).
Thanks,
Note: People already gave their solutions on that problem i just want to know why my regex fails.
The backtracking mechanism is in charge of this phenomenon, as you have already pointed out.
The ? quantifier, matching 1 or 0 repetitions of the quantified subpattern lets the regex engine match the string in two ways: either matching the quantified subpattern, or go on matching the string with subsequent subpattern.
So, advertorials/?(?!version)(.*) (I removed the redundant (?:...) non-capturing group), when applied to domain.com/advertorials/version1?test=true, matches advertorials, then matches /, and then the negative lookahead checks if, immediately to the right of the current position, there is version substring. Since there is version after /, the regex engine goes back and sees that /? pattern can match an empty string. So, the lookahead check is re-applied striaght after advertorials. There is no version after advertorials, and the match is returned.
The usual solution is using possessive quantifiers or atomic groups, but there are other approaches, too.
E.g.
advertorials\/?+(?!version)(.*)
^^
See the regex demo. Here, \/?+ matches 1 or 0 / chars, but once it matches, the egine cannot go back and re-match a part of a string with this pattern.
Or, you may include the /? in the lookahead and place it before /? pattern:
advertorials(?!\/?version)\/?(.*)
See another regex demo.
If you plan to disallow version anywhere after advertorials use
advertorials(?!.*version)\/?(.*)
See yet another demo.
Making the slash optional means there is a way to match without violating the constraint. If there is a way to match, the regex engine will find it, always.
Make the slash non-optional when it's followed by anything at all.
advertorials(?:/(?!version).*)?$
Incidentally, regex itself doesn't require the slash to be backslash-escaped (though some host languages use slashes as regex delimiters, so maybe you need to put it back). I also removed some redundant parentheses.
The reason:
This highlighted part is optional
advertorials\/?(?:(?!version)(.*))
Therefore it can also be advertorials(?:(?!version)(.*))
which matches advertorials/version
Essentially, (?!version)(.*) matches /version
Btw, this is normal backtracking by 1 character.
If you have already fixed it, then we're done !

Regex everything after, but not including

I am trying to regex the following string:
https://www.amazon.com/Tapps-Top-Apps-and-Games/dp/B00VU2BZRO/ref=sr_1_3?ie=UTF8&qid=1527813329&sr=8-3&keywords=poop
I want only B00VU2BZRO.
This substring is always going to be a 10 characters, alphanumeric, preceded by dp/.
So far I have the following regex:
[d][p][\/][0-9B][0-9A-Z]{9}
This matches dp/B00VU2BZRO
I want to match only B00VU2BZRO with no dp/
How do I regex this?
Here is one regex option which would produce an exact match of what you want:
(?<=dp\/)(.*)(?=\/)
Demo
Note that this solution makes no assumptions about the length of the path fragment occurring after dp/. If you want to match a certain number of characters, replace (.*) with (.{10}), for example.
Depending on your language/method of application, you have a couple of options.
Positive look behind. This will make your regex more complicated, but will make it match what you want exactly:
(<=dp/)[0-9A-Z]{10}
The construct (<=...) is called a positive look behind. It will not consume any of the string, but will only allow the match to happen if the pattern between the parens is matched.
Capture group. This will make the regex itself slightly simpler, but will add a step to the extraction process:
dp/([0-9A-Z]{10})
Anything between plain parens is a capture group. The entire pattern will be matched, including dp/, but most languages will give you a way of extracting the portion you are interested in.
Depending on your language, you may need to escape the forward slash (/).
As an aside, you never need to create a character class for single characters: [d][p][\/] can equally well be written as just dp\/.

Google Analytics Regex - Alternative to no negative lookahead

Google Analytics does not allow negative lookahead anymore within its filters. This is proving to be very difficult to create a custom report only including the links I would like it to include.
The regex that includes negative lookahead that would work if it was enabled is:
test.com(\/\??index\_(.*)\.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)
This matches:
test.com
test.com/
test.com/index_fb2.php
test.com/index_fb2.php?ref=23
test.com/index_fb2.php?ref=23&e=35
test.com/?ref=23
test.com/?ref=23&e=35
and does not match (as it should):
test.com/ambassadors
test.com/admin/?signup=true
test.com/randomtext/
I am looking to find out how to adapt my regex to still hold the same matches but without the use of negative lookahead.
Thank you!
Google Analytics doesn't seem to support single-line and multiline modes, which makes sense to me. URLs can't contain newlines, so it doesn't matter if the dot doesn't match them and there's never any need for ^ and $ to match anywhere but the beginning and end of the whole string.
That means the (?!.) in your regex is exactly equivalent to $, which matches only at the very end of the string (like \z, in flavors that support it). Since that's the only lookahead in your regex, you should never have have had this problem; you should have been using $ all along.
However, your regex has other problems, mostly owing to over-reliance on (.*). For example, it matches these strings:
test.com/?^#(%)!*%supercalifragilisticexpialidocious
test.com/index_ecky-ecky-ecky-ecky-PTANG!-vroop-boing_rowr.php (ni! shh!)
...which I'm pretty sure you don't want. :P
Try this regex:
test\.com(?:/(?:index_\w+\.php)?(?:\?ref=\d+(?:&e=\d+)?)?)?\s*$
or more readably:
test\.com
(?:
/
(?:index_\w+\.php)?
(?:
\?ref=\d+
(?:
&e=\d+
)?
)?
)?
\s*$
For illustration purposes I'm making a lot of simplifying assumptions about (e.g.) what parameters can be present, what order they'll appear in, and what their values can be. I'm also wondering if it's really necessary to match the domain (test.com). I have no experience with Google Analytics, but shouldn't the match start (and be anchored) right after domain? And do you really have to allow for whitespace at the end? It seems to me the regex should be more like this:
^/(?:index_\w+\.php)?(?:\?ref=\d+(?:&e=\d+)?)?$
Firstly I think your regex needs some fixing. Let's look at what you have:
test.com(\/\??index_.*.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)
The case where you use the optional ? at the start of index... is already taken care of by the second alternative:
test.com(\/index_.*.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)
Now you probably only want the first (.*) to be allowed, if there actually was a literal ? before. Otherwise you will match test.com/index_fb2.phpanystringhereandyouprobablydon'twantthat. So move the corresponding optional marker:
test.com(\/index_.*.php(\?(.*))?|\/\?(.*)|\/|)+(\s)*(?!.)
Now .* consumes any character and as much as possible. Also, the . in front of php consumes any character. This means you would be allowing both test.com/index_fb2php and test.com/index_fb2.html?someparam=php. Let's make that a literal . and only allow non-question-mark characters:
test.com(\/index_[^?]*\.php(\?(.*))?|\/\?(.*)|\/|)+(\s)*(?!.)
Now the first and second and third option can be collapsed into one, if we make the file name optional, too:
test.com(\/(index_[^?]*\.php)?(\?(.*))?|)+(\s)*(?!.)
Finally, the + can be removed, because the (.*) inside can already take care of all possible repetitions. Also (something|) is the same as (something)?:
test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*(?!.)
Seeing your input examples, this seems to be closer to what you actually want to match.
Then to answer your question. What (?!.) does depends on whether you use singleline mode or not. If you do, it asserts that you have reached the end of the string. In this case you can simply replace it by \Z, which always matches the end of the string. If you do not, then it asserts that you have reached the end of a line. In this case you can use $ but you need to also use multi-line mode, so that $ matches line-endings, too.
So, if you use singleline mode (which probably means you have only one URL per string), use this:
test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*\Z
If you do not use singleline mode (which probably means you can have multiple URLs on their own lines), you should also use multiline mode and this kind of anchor instead:
test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*$

Regex - not in the list but matched anyway

This is a bit hard to sum up in a title, but here is my problem:
(?:(?:http|https):\\/\\/)?(?:\\/\\/www\\.)?youtube.com\\/watch\\?(?:.*)v=(\\w{11}).*
Given the expression given below, I really really don't understand why ftp://www.youtube.com/watch?v=F5eScJmYZZ8 matches. I unsuccessfully tried to add ^ to the expression beginning, but then, my expression does not match anything anymore (this is done in Java, that explains the doubled backslashes).
How can ftp be accepted as it is clearly not listed in (http|ftp)?
EDIT
To be accurate, here is what is allowed:
http(s)://www.[...]
http(s)://[...]
www.[...]
[...]
and nothing else.
Because ? after the http part the means that it is optional. Use + instead of ?.
Also, you are checking for // after http twice.
\s* allows whitespace at the beginning. If you don't want to allow whitespace (i.e., the input text will contain only 1 match), use ^ instead.
Here is the working regex that meets all of your added requirements:
\s*(?:(http|https)\:\/\/)?(?:www\.)?youtube.com\/watch\?(?:.*)v=(\w{11}).*
Because the leading (?:(?:http|https):\\/\\/)? is optional. That's what the question mark at the end of the group signifies (match at most one, i.e. match only if it exists).
A leading ^ should prevent the match with ftp though. Can you post the failing regex you tried (with the ^)?
UPDATE:
Aha! It matches without the ^ since the http group is optional, and anything can come before the match (e.g. cheeseyoutube.com/... would match). Adding a ^ to the beginning of the regex fixes this, but there's another problem with your regex: the www group is trying to match two slashes (as first pointed out in Justin's answer), which it can't once the http group has already matched those slashes. So the www group fails to match (fine, since it's optional), but then the youtube part can't match since there's an unmatched www in the way!
This should fix your problem:
^(?:(?:http|https):\\/\\/)?(?:www\\.)?youtube.com\\/watch\\?(?:.*)v=(\\w{11}).*