regexp to extract an URL components

regexp to extract an URL components - regex

I am trying to write a regexp to extract an URL components. The syntax can be found here: RFC 3986.
Some of the components are optional. So far I have:
(.+)://((.*)#)?(.+?)(:(\d*))?/((.*)\?)?((.*)#)?(.*)
The decomposition is:
(.+):// matches the scheme followed by ://. Not optional.
((.*)#)? matches the user information part of authority. Optional.
(.+?) matches the host. Not optional. There is an issue here where this group will also match the optional port.
(:(\d*))? should matches the port.
/ this and all that follows should be made optional.
((.*)\?)? matches the path part. Optional.
((.*)#)? matches the query part. Optional.
(.*) matches the fragment part. Optional.
How can I improve this regexp so that it is RFC3986-valid ?
Fun fact: this regexp matches itself.
Example URL (taken from the RFC): foo://example.com:8042/over/there?name=ferret#nose
Edit: I forgot to escape d. Now all that's left to do is to make everything that follows the host optional, including the leading /.

Your regular expression works fine if you just escape the slashes and preferably the colon as well. The result is (.+)\:\/\/(.*#)?(.+?)(:(\d*))?\/((.*)\?)?((.*)#)?(.*). Here is a simple script to show how it can be used to filter out invalid URIs:
Update
Following the comments I have made the following modification:
I have added (\:((\d*)\/))?(\/)*. Explanation:
\:((\d*) matches a colon and then any string of digits.
the \/ after this matches a slash which should be after this string of digits. This is because the port must not contain any other characters but digits. So they cannot be found in the port-portion of the uri.
Finally, the entire port-matching expression is optional, hence the ?.
The last part indicates that many or no slashes can follow the existing/non-existing port
Final regExp:
(.+)\:\/\/(.*\#)?(.+?)(\:((\d*)\/))?(\/)*((.*)\?)?((.*)\#)?(.*)
const myRegEx = new RegExp("(.+)\:\/\/(.*\#)?(.+?)(\:((\d*)\/))?(\/)*((.*)\?)?((.*)\#)?(.*)", "g");
const allUris = [
/*Valid*/ "https://me#data.www.example.com:5050/page?query=value#element",
/*Valid*/ "foo://example.com:8042/over/there?name=ferret#nose",
/*Valid*/ "foo://example.com",
/*Not valid*/ "www.example.com"];
const allowedUris = allUris.map(uri => {
// Use the regexp to match it, then return the match
const match = uri.match(myRegEx);
return match;
});
console.log("Here are the valid URIs:");
console.log(allowedUris.join("\n\n")); // Should only print the first two URIs from the array.

I have found a better way to handle this, while also not breaking the capture groups.
(\w[\w\d\+\.-]*)://(.+#)?([^:/\?#]+)(:\d+)?(/[^\?#]*)?(\?[^#]+)?(#.*)?
Decomposition:
(\w[\w\d\+\.-]*):// matches a valid scheme, as per RFC-3986.
(.+#)? matches user information; i.e. everything until #, optionally.
([^:/\?#]+) matches host; i.e. everything until : or / or ? or # is encountered.
(:\d+)? matches port; i.e. all digits, optionally
(/[^\?#]*)? matches path; i.e. / plus optionally every character until ? or # is encountered, optionally.
(\?[^#]+)? matches query; i.e. ? plus every characters until # is encountered, optionally.
(#.*)? matches fragment; i.e. # plus everything after, optionally.
EDIT: final version I'm using (capture groups added for better extraction + discarding new-line characters):
(\w[\w\d\+\.-]*)://((.+)#)?([^:/\?#\r\n]+)(:(\d+))?(/([^\?#\r\n]*))?(\?([^#\r\n]*))?(#(.*))?
Try it online against randomly generated url's.

Related

Regex to match string that does not contain slash

I am trying to set up a route using vue-router in a web app using regex to match the pattern. The pattern I am looking to match is any string that contains alphanumeric characters (and underscore) without slashes. Here are some examples (the first slash is just to show the string after the domain e.g. example.com/):
/codestack
/demo45
/i_am_long
Strings that should not match would be:
/data/files.xml
/share/home.html
/demo45/photos
The only regex I came up with so far is:
path: '/:Username([a-zA-Z0-9]+)'
That is not quite right because it matches all the characters except for the slash. Whereas I want to only match on the first set of alphanumeric characters (including underscore) before the first forward slash is encountered.
If a route contains a forward slash e.g. /data/files.xml then that should be a different regex route match. Therefore I also need a regex pattern to match the examples above containing slashes. Theoretically, they could contain any number of slashes e.g. /demo45/photos/holiday/2015/bahamas.

For the first part, you can match 1 or more word characters which will also match an underscore.
The anchors ^ and $ assert the start and end of the string.
^\w+$
For the second one, you can start the match with word characters followed by /
In case of more forward slashes you can optionally repeat the first pattern in a group.
The last part after the pattern can be 1 or more word characters with a optional part matching a dot and word characters.
^\w+/(?:\w+/)*\w+(?:\.\w+)?$
Regex demo
If you want to match any char except / you can use [^/]
^(?:[^/\s]+/)+[^/\s]+$
Regex demo

Regex - find the param in a url in any position in the string

I am trying to match a url param and this param's position is not fixed in the uri. It can show up sometime right after the ? or after the &. I need to match vr=359821 param in the below uri's. How can I do this.
Example urls:
/br/col/aon/11631?vr=359821&cId=9113
/br/col/aon/11631?cId=9113&vr=359821
/br/col/aon/11631?cId=9113&vr=359821&grid=2&page=something
Somethings I tried:
I tried to use backreferencing (not sure if this is right approach) but was not successful.
I was trying to group them and may be backreference to find the string within that group.
(\/br\/col\/aon\/11631)(\?cId=9113&(vr=359821)) # this matches second url above but not others.
(\/br\/col\/aon\/11631)(\?cId=9113&(vr=359821)).+?\3 # this is wrong I know.
(\/br\/col\/aon\/11631)(\?cId=9113&(vr=359821)).*?\2[vr=359821] # this is wrong
Above regex are wrong but my idea was to make it a group and match vr=359821 in that group. I dont know if this is even possible in regex.
why I am doing this:
The final goal is to redirect this url to a different url with all the params from original request in ngnix.

In the last 2 patterns that you tried, you are using a backreference like \2 and \3. But a backreference will match the same data that was already captured in the corresponding group.
In this case, that is not the desired behaviour. Instead, you want to match a key value pair in the uri, which does not have to exist in the content before.
Therefore you can match the start of the pattern followed by a non greedy quantifier (as it can also occur right after the question mark) to match the first occurrence of vr= followed by 1 or more digits.
In the comments I suggested this pattern \/br\/col\/aon\/11631\b.*?[?&](vr=\d+), but (depending on the regex delimiters) you don't have to escape the forward slash.
The pattern could be
/br/col/aon/11631\b.*?[?&](vr=\d+)
The pattern matches
/br/col/aon/11631\b Match the start of the pattern followed by a word boundary
.*? Match any char as least as possible
[?&] Match either ? or &
(vr=\d+) Capture group 1, match vr= followed by 1+ digits
Regex demo
From what I read is that nginx uses PCRE. To get a more specific pattern, one option could be:
/br/col/aon/11631\?.*?(?<=[?&])(vr=\d+)(?=\&|$)
This pattern matches
/br/col/aon/11631\? Match the start of the pattern followed by the question mark
.*? Match any char as least as possible
(?<=[?&]) Positive lookbehind, assert what is directy to the left is either ? or &
(vr=\d+) Capture group 1, match vr= followed by 1+ digits
(?=\&|$) Positive lookahead, assert what is directly to the right is & or the end of the string to prevent a partial match
Regex demo

Regex about url encoded string

Would like to write one regex to get the url encoded string in below line:
<topicref href="%E4%BA%B0.txt"/>
When I used a regex like (%[A-Z][0-9])+\.txt it only got %B0.txt. What can I do if I want to get the whole url encoded string such like %E4%BA%B0.txt.
Thanks a lot.

Proper URL encoding uses hex digits only, A-F not A-Z. The encoded URL could contain non-encoded characters anywhere. Also, you should escape the full stop.
((%[0-9A-F]{2}|[^<>'" %])+)\.txt
is a quick ad-hoc fix for your regex, though obviously for any production code, probably don't use a regex for this at all, or at the very least try a well-defined and properly tested URL regex like the one you can find in the HTTP RFC.
Putting the + quantifier outside the capturing parentheses will only return the last repetition. I added a second set of parentheses to put the quantifier inside the first capture group, which assumes you are doing something to extract the first capture group in particular. (If your regex dialect has non-capturing groups, you could change the second opening parenthesis to non-capturing, i.e. (?:.)

You need to change your regex to
([%\dA-Z]+)\.txt
([%\dA-Z]+) - Match %, digits and alphabets one or more time
\.txt - Match .txt
where as your regex means
(%[A-Z][0-9])+.txt
(%[A-Z][0-9])+
% - Match %
[A-Z] - Match A to Z one time
[0-9] - Match any digit one or more time
+ - Match the captured group one or more time
.txt - Match single character (anything except new line) followed by txt

how to code correct URI regex

having different URI pattern trying to find out correct regex to cover all of them, for example:
1) href="http://site.example.com/category/
and
2) href="http://site.example.com/en/page/
Using href=".+..+..+/(.+?)" respects first url, in second url skip en/page.
How to read everything after href="http://site.example.com/ ?

This should do it:
[^\./]+\.[^\./]+\.[^\./]+(?:/(.*))?
That is:
[^\./]+ = (anything but . and /)
\. = dot
...? = Zero or one occurrence(s) of ...
(?:...)? = Zero or one of ..., which is more than one character, but without capturing ....
(?:/(.*))? = Capture everything after the last /, if there is one.
Tested here.

. in regex means any character (except \n newline), + means one or more of the previous expression, ? means 0 or 1 of previous expression; also forces minimal matching when an expression might match several strings within a search string (e.g. http://regexlib.com/CheatSheet.aspx).
A literal dot is matched by \..
So your regex boils down to at least five signs, a slash sign, at least one sign, but you don't have to.
Meaning it matches even http:/. And it does match both of your examples (tested with egrep and grep -P), but only if you replace href=" by href=\" and leave the last " out. Otherwise it will match none.
What you probably wanted was something like:
.+\..+\..+/.*
Or, if you want to be sure to match only urls, you might consider
http[s]?://([a-z]+\.)?[a-z]+\.[a-z]+/?[a-z/]?
The http[s]?: as a fixed part starts the expression (the s in case the ref comes from a secure connection). [a-z] means match only lowercase letters. As you might stumble upon sites that don't have a subdomain in the name like stackoverflow.com, the first [a-z]+\. is questionmarked. The end of url slash, too. [a-z/] means match only lowercase letters and slashes.

regex match till a character from a second occurance of a different character

My question is pretty similar to this question and the answer is almost fine. Only I need a regexp not only for character-to-character but for a second occurance of a character till a character.
My purpose is to get password from uri, example:
http://mylogin:mypassword#mywebpage.com
So in fact I need space from the second ":" till "#".

You could give the following regex a go:
(?<=:)[^:]+?(?=#)
It matches any consecutive string not containing any : character, prefixed by a : and suffixed by a #.
Depending on your flavour of regex you might need something like:
:([^:]+?)#
Which doesn't use lookarounds, this includes the : and # in the match, but the password will be in the first capturing group.
The ? makes it lazy in case there should be any # characters in the actual url string, and as such it is optional. Please note that that this will match any character between : and # even newlines and so on.

Here's an easy one that does not need look-aheads or look-behinds:
.*:.*:([^#]+)#
Explanation:
.*:.*: matches everything up to (and including) the second colon (:)
([^#]+) matches the longest possible series of non-# characters
# - matches the # character.
If you run this regex, the first capturing group (the expression between parentheses) will contain the password.
Here it is in action: http://regex101.com/r/fT6rI0

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

regexp to extract an URL components - regex

Related

Regex to match string that does not contain slash

Regex - find the param in a url in any position in the string

Regex about url encoded string

how to code correct URI regex

regex match till a character from a second occurance of a different character

Categories

Resources