how to code correct URI regex - regex

having different URI pattern trying to find out correct regex to cover all of them, for example:
1) href="http://site.example.com/category/
and
2) href="http://site.example.com/en/page/
Using href=".+..+..+/(.+?)" respects first url, in second url skip en/page.
How to read everything after href="http://site.example.com/ ?

This should do it:
[^\./]+\.[^\./]+\.[^\./]+(?:/(.*))?
That is:
[^\./]+ = (anything but . and /)
\. = dot
...? = Zero or one occurrence(s) of ...
(?:...)? = Zero or one of ..., which is more than one character, but without capturing ....
(?:/(.*))? = Capture everything after the last /, if there is one.
Tested here.

. in regex means any character (except \n newline), + means one or more of the previous expression, ? means 0 or 1 of previous expression; also forces minimal matching when an expression might match several strings within a search string (e.g. http://regexlib.com/CheatSheet.aspx).
A literal dot is matched by \..
So your regex boils down to at least five signs, a slash sign, at least one sign, but you don't have to.
Meaning it matches even http:/. And it does match both of your examples (tested with egrep and grep -P), but only if you replace href=" by href=\" and leave the last " out. Otherwise it will match none.
What you probably wanted was something like:
.+\..+\..+/.*
Or, if you want to be sure to match only urls, you might consider
http[s]?://([a-z]+\.)?[a-z]+\.[a-z]+/?[a-z/]?
The http[s]?: as a fixed part starts the expression (the s in case the ref comes from a secure connection). [a-z] means match only lowercase letters. As you might stumble upon sites that don't have a subdomain in the name like stackoverflow.com, the first [a-z]+\. is questionmarked. The end of url slash, too. [a-z/] means match only lowercase letters and slashes.

Related

Regex Pattern to Match except when the clause enclosed by the tilde (~) on both sides

I want to extract matches of the clauses match-this that is enclosed with anything other than the tilde (~) in the string.
For example, in this string:
match-this~match-this~ match-this ~match-this#match-this~match-this~match-this
There should be 5 matches from above. The matches are explained below (enclosed by []):
Either match-this~ or match-this is correct for first match.
match-this is correct for 2nd match.
Either ~match-this# or ~match-this is correct for 3rd match.
Either #match-this~ or #match-this or match-this~ is correct for 4th match.
Either ~match-this or match-this is correct for 5th match.
I can use the pattern ~match-this~ catch these ~match-this~, but when I tried the negation of it (?!(~match-this)), it literally catches all nulls.
When I tried the pattern [^~]match-this[^~], it catches only one match (the 2nd match from above). And when I tried to add asterisk wild card on any negation of tilde, either [^~]match-this[^~]* or [^~]*match-this[^~], I got only 2 matches. When I put the asterisk wild card on both, it catches all match-this including those which enclosed by tildes ~.
Is it possible to achieve this with only one regex test? Or Does it need more??
If you also want to match #match-this~ as a separate match, you would have to account for # while matching, as [^~] also matches #
You could match what you don't want, and capture in a group what you want to keep.
~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)
Explanation
~[^~#]*~ Match any char except ~ or # between ~
| Or
( Capture group 1
(?:(?!match-this).)* Match any char if not directly followed by *match-this~
match-this Match literally
(?:(?!match-this)[^#~])* Match any char except ~ or # if not directly followed by match this
) Close group 1
See a regex demo and a Python demo.
Example
import re
pattern = r"~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)"
s = "match-this~match-this~ match-this ~match-this#match-this~match-this~match-this"
res = [m for m in re.findall(pattern, s) if m]
print (res)
Output
['match-this', ' match-this ', '~match-this', '#match-this', 'match-this']
If all five matches can be "match-this" (contradicting the requirement for the 3rd match) you can match the regular expression
~match-this~|(\bmatch-this\b)
and keep only matches that are captured (to capture group 1). The idea is to discard matches that are not captured and keep matches that are captured. When the regex engine matches "~match-this~" its internal string pointer is moved just past the closing "~", thereby skipping an unwanted substring.
Demo
The regular expression can be broken down as follows.
~match-this~ # match literal
| # or
( # begin capture group 1
\b # match a word boundary
match-this # match literal
\b # match a word boundary
) # end capture group 1
Being so simple, this regular expression would be supported by most regex engines.
For this you need both kinds of lookarounds. This will match the 5 spots you want, and there's a reason why it only works this way and not another and why the prefix and/or suffix can't be included:
(?<=~)match-this(?!~)|(?<!~)match-this(?=~)|(?<!~)match-this(?!~)
Explaining lookarounds:
(?=...) is a positive lookahead: what comes next must match
(?!...) is a negative lookahead: what comes next must not match
(?<=...) is a positive lookbehind: what comes before must match
(?<!...) is a negative lookbehind: what comes before must not match
Why other ways won't work:
[^~] is a class with negation, but it always needs one character to be there and also consumes that character for the match itself. The former is a problem for a starting text. The latter is a problem for having advanced too far, so a "don't match" character is gone already.
(^|[^~]) would solve the first problem: either the text starts or it must be a character not matching this. We could do the same for ending texts, but this is a dead again anyway.
Only lookarounds remain, and even then we have to code all 3 variants, hence the two |.
As per the nature of lookarounds the character in front or behind cannot be captured. Additionally if you want to also match either a leading or a trailing character then this collides with recognizing the next potential match.
It's a difference between telling the engine to "not match" a character and to tell the engine to "look out" for something without actually consuming characters and advancing the current position in the text. Also not every regex engine supports all lookarounds, so it matters where you actually want to use it. For me it works fine in TextPad 8 and should also work fine in PCRE (f.e. in PHP). As per regex101.com/r/CjcaWQ/1 it also works as expected by me.
What irritates me: if the leading and/or trailing character of a found match is important to you, then just extract it from the input when processing all the matches, since they also come with starting positions and lengths: first match at position 0 for 10 characters means you look at input text position -1 and 10.

regexp to extract an URL components

I am trying to write a regexp to extract an URL components. The syntax can be found here: RFC 3986.
Some of the components are optional. So far I have:
(.+)://((.*)#)?(.+?)(:(\d*))?/((.*)\?)?((.*)#)?(.*)
The decomposition is:
(.+):// matches the scheme followed by ://. Not optional.
((.*)#)? matches the user information part of authority. Optional.
(.+?) matches the host. Not optional. There is an issue here where this group will also match the optional port.
(:(\d*))? should matches the port.
/ this and all that follows should be made optional.
((.*)\?)? matches the path part. Optional.
((.*)#)? matches the query part. Optional.
(.*) matches the fragment part. Optional.
How can I improve this regexp so that it is RFC3986-valid ?
Fun fact: this regexp matches itself.
Example URL (taken from the RFC): foo://example.com:8042/over/there?name=ferret#nose
Edit: I forgot to escape d. Now all that's left to do is to make everything that follows the host optional, including the leading /.
Your regular expression works fine if you just escape the slashes and preferably the colon as well. The result is (.+)\:\/\/(.*#)?(.+?)(:(\d*))?\/((.*)\?)?((.*)#)?(.*). Here is a simple script to show how it can be used to filter out invalid URIs:
Update
Following the comments I have made the following modification:
I have added (\:((\d*)\/))?(\/)*. Explanation:
\:((\d*) matches a colon and then any string of digits.
the \/ after this matches a slash which should be after this string of digits. This is because the port must not contain any other characters but digits. So they cannot be found in the port-portion of the uri.
Finally, the entire port-matching expression is optional, hence the ?.
The last part indicates that many or no slashes can follow the existing/non-existing port
Final regExp:
(.+)\:\/\/(.*\#)?(.+?)(\:((\d*)\/))?(\/)*((.*)\?)?((.*)\#)?(.*)
const myRegEx = new RegExp("(.+)\:\/\/(.*\#)?(.+?)(\:((\d*)\/))?(\/)*((.*)\?)?((.*)\#)?(.*)", "g");
const allUris = [
/*Valid*/ "https://me#data.www.example.com:5050/page?query=value#element",
/*Valid*/ "foo://example.com:8042/over/there?name=ferret#nose",
/*Valid*/ "foo://example.com",
/*Not valid*/ "www.example.com"];
const allowedUris = allUris.map(uri => {
// Use the regexp to match it, then return the match
const match = uri.match(myRegEx);
return match;
});
console.log("Here are the valid URIs:");
console.log(allowedUris.join("\n\n")); // Should only print the first two URIs from the array.
I have found a better way to handle this, while also not breaking the capture groups.
(\w[\w\d\+\.-]*)://(.+#)?([^:/\?#]+)(:\d+)?(/[^\?#]*)?(\?[^#]+)?(#.*)?
Decomposition:
(\w[\w\d\+\.-]*):// matches a valid scheme, as per RFC-3986.
(.+#)? matches user information; i.e. everything until #, optionally.
([^:/\?#]+) matches host; i.e. everything until : or / or ? or # is encountered.
(:\d+)? matches port; i.e. all digits, optionally
(/[^\?#]*)? matches path; i.e. / plus optionally every character until ? or # is encountered, optionally.
(\?[^#]+)? matches query; i.e. ? plus every characters until # is encountered, optionally.
(#.*)? matches fragment; i.e. # plus everything after, optionally.
EDIT: final version I'm using (capture groups added for better extraction + discarding new-line characters):
(\w[\w\d\+\.-]*)://((.+)#)?([^:/\?#\r\n]+)(:(\d+))?(/([^\?#\r\n]*))?(\?([^#\r\n]*))?(#(.*))?
Try it online against randomly generated url's.

Regex about url encoded string

Would like to write one regex to get the url encoded string in below line:
<topicref href="%E4%BA%B0.txt"/>
When I used a regex like (%[A-Z][0-9])+\.txt it only got %B0.txt. What can I do if I want to get the whole url encoded string such like %E4%BA%B0.txt.
Thanks a lot.
Proper URL encoding uses hex digits only, A-F not A-Z. The encoded URL could contain non-encoded characters anywhere. Also, you should escape the full stop.
((%[0-9A-F]{2}|[^<>'" %])+)\.txt
is a quick ad-hoc fix for your regex, though obviously for any production code, probably don't use a regex for this at all, or at the very least try a well-defined and properly tested URL regex like the one you can find in the HTTP RFC.
Putting the + quantifier outside the capturing parentheses will only return the last repetition. I added a second set of parentheses to put the quantifier inside the first capture group, which assumes you are doing something to extract the first capture group in particular. (If your regex dialect has non-capturing groups, you could change the second opening parenthesis to non-capturing, i.e. (?:.)
You need to change your regex to
([%\dA-Z]+)\.txt
([%\dA-Z]+) - Match %, digits and alphabets one or more time
\.txt - Match .txt
where as your regex means
(%[A-Z][0-9])+.txt
(%[A-Z][0-9])+
% - Match %
[A-Z] - Match A to Z one time
[0-9] - Match any digit one or more time
+ - Match the captured group one or more time
.txt - Match single character (anything except new line) followed by txt

regex match till a character from a second occurance of a different character

My question is pretty similar to this question and the answer is almost fine. Only I need a regexp not only for character-to-character but for a second occurance of a character till a character.
My purpose is to get password from uri, example:
http://mylogin:mypassword#mywebpage.com
So in fact I need space from the second ":" till "#".
You could give the following regex a go:
(?<=:)[^:]+?(?=#)
It matches any consecutive string not containing any : character, prefixed by a : and suffixed by a #.
Depending on your flavour of regex you might need something like:
:([^:]+?)#
Which doesn't use lookarounds, this includes the : and # in the match, but the password will be in the first capturing group.
The ? makes it lazy in case there should be any # characters in the actual url string, and as such it is optional. Please note that that this will match any character between : and # even newlines and so on.
Here's an easy one that does not need look-aheads or look-behinds:
.*:.*:([^#]+)#
Explanation:
.*:.*: matches everything up to (and including) the second colon (:)
([^#]+) matches the longest possible series of non-# characters
# - matches the # character.
If you run this regex, the first capturing group (the expression between parentheses) will contain the password.
Here it is in action: http://regex101.com/r/fT6rI0

I can't find a regex for this Regular Expression

I am not very good with regular expressions, and I just have a simple question here.
I have a list of links in this way:
http://domain.com/andrei/sometext
http://domain2.com/someothertext/sometextyouknow/whoknows
http://domain341.com/text/thisisit/haha
I just want two regular expressions, to take this out:
http://domain.com/andrei/
http://domain2.com/someothertext/
http://domain341.com/text/
This is the first regex that I need, and I need another regex only to take out the domain, but I guess I'll figure that out if somebody could tell me the regex to take out only what I wrote.
This is what you (most likely) need:
[a-z]+://([^/ ]+)(?:/[^/ ]*/?)?
Here's how it works:
[a-z]+ part is for protocol name (this means, "1 or more letters" - it will match http/https/file/ftp/gopher/foo/whatever protocol, but if you want to match only "http" you can write it explicitly)
:// is literally what it says ;)
[^/ ]+ is one or more non-slash and non-space character. it can be "a", can be fqdn, can be ip address. whatever
(?:/[^/ ]*/?)? - this one is more complicated. The ? in the end means that this whole thing in parentheses may or may not be there (it is optional). ?: immediately inside parentheses means do not reuse this sub-pattern (it is not assigned a number and cannot be re-used later by that number). [^/ ]* means 0 or more non-slash non-space characters, and the question mark after the trailing slash, again, states that the slash is optional.
Overall, this ensures matches for things like this:
http://foo/bar/baz/something -> http://foo/bar/
http://hello.world.example.com/ -> http://hello.world.example.com/
http://foo.net -> http://foo.net
ftp://ftp.mozilla.org/pub -> ftp://ftp.mozilla.org/pub
NOTE #1: I did not use escaping for forward slashes intentionally to make the expression more readable, so make sure you use some other character as a delimiter, OR escape all the appearances of / - use \/ instead.
NOTE #2: Add i modifier if you want the expression to be case-insensitive (a-z will not match caps), and g modifier if you want to make multiple matches in one big block of text.
In the matches, subpattern 0 will be the whole matched thing, and subpattern 1 - only hostname
This is probably what you are looking for:
([a-zA-Z]+://([\w.]*)/(?:.*?/)?)
You have all the match in the group 1 and just the domain in the group 2. No need for 2 regular expressions. :)
Use regex https?:\/\/[^\/]+\/[^\/]+/(.*) for your first task - replace $1 with emtpy string ''.
Use regex https?:\/\/([^\/]+) for your second task - a match $1 is the domain name.