Fixing incorrect server scheme in URL input - regex

I'm accepting web address input from a form, however some URL's are formatted incorrectly like this (dont ask):
http:/www.foo.com
ftp:/ftp.bar.com
scp:/meh.foobar.com
What I want to do is detect if only one forward slash is present and add a second one.
I've never been any good at regular expressions, so I tried it with a combination of substr and parse_url to pick out the scheme, then strip it. But it was a bit of a mess and stripped valid URL's adding a triple /// in the scheme and taking out letters from the hostname in some cases.
Help please :)

use this regular expression: '.*:/[a-zA-Z].*'
it will be matches with:
http:/www.foo.com
ftp:/ftp.bar.com
scp:/meh.foobar.com
and do not matches with:
http://www.foo.com
ftp://ftp.bar.com
scp://meh.foobar.com
then replace your invalid characters using replace(i dont know what language you are using):
url = myUrl.replace(url,':/','://')

Related

regular expressions: catch any URLs of the domain example.com

I'm trying to get regexp code for the below case. I tried multiple tries but in vain.
I need to catch any URLs of the domain site.com. Tried using regexp '^site.com/*$
but it does not recognizes it.
i'm just looking for regexp code whichmatches site.com/*
With your expression ^site.com/*$ you match all strings that start with site.com and have zero or more trailing / characters (/*):
If you want to match any strings starting with site.com/ you might want to try ^site\.com/.*$:
There are already a lot of other regex questions regarding domain names on SO, but your question is not clear to me in what context you are trying to do this, or what is the actual goal you want to achieve. If you describe your needs more precisely you could probably find some answers on this forum.
I generally use a helper website like regex101.com.
Also, a few things to note, . has a special meaning in regex meaning any character, and if you wanted to capture site.com/foo you might want to use something where you are not limited to the number of characters by the end. I'd do this with groupings.
^(site\.com\/)(.+)$
You can see this in action here: https://regex101.com/r/AU2iYC/2
Your regex ^site.com/*$ is only matched follow sentences
ex) site.com/ site.com//////// site.com
because * asterisk in regex means Match 0 or more of the preceding token.
so, it should be work
^site.com\/.*$

RegEx for URL ending with a query string

I've been using Ensighten for tag management. The way it manages conditions (which pages the tracking scripts are deployed onto) is by using RegEx for the protocol, host and path separately.
Right now, my current condition looks a bit like this:
Protocol: ^(https?)$
Host: ^((www|www-qa)\.example\.com)$
Path: ^(/section-one/page/?|/section-two/page/?|/section-three/page/?)$
This works fine. However, I've been asked to add a URL ending with a query string, and that's where I'm having an issue.
Essentially, I need to also target a URL with the following format:
http://www.example.com/section-one/page?&var123=456
How do I edit my RegEx for the URL path to include this path?
/section-one/page?&var[any numbers, letters, symbols]
Note that for this /section-one/, I only want to target /page or /page + a query string, no subpages. I don't want to target a specific query string. I also want the other pages already in my RegEx to remain included.
How do I write this expression? I have to stick to the "must match this RegEx" single-expression format.
Thanks!
Improvement of what you've got:
^/section-(one|two|three)/page/?$
Backslash can generally escape the special meaning of a question mark, though it depends on flavor.
^/section-(one|two|three)/page/?($|\?)
This assumes that the $ above is more than just a formality, i.e. that it matches the end of the URL.
If you need to use capturing parentheses to actually store the query string, the engine should give you the longest possible match for .* so that there is of course:
^/section-(one|two|three)/page/?($|\?.*)

RegEx filter links from a document

I am currently learning regex and I am trying to filter all links (eg: http://www.link.com/folder/file.html) from a document with notepad++. Actually I want to delete everything else so that in the end only the http links are listed.
So far I tried this : http\:\/\/www\.[a-zA-Z0-9\.\/\-]+
This gives me all links which is find, but how do I delete the remaining stuff so that in the end I have a neat list of all links?
If I try to replace it with nothing followed by \1, obviously the link will be deleted, but I want the exact opposite to have everything else deleted.
So it should be something like:
- find a string of numbers, letters and special signs until "http"
- delete what you found
- and keep searching for more numbers, letters ans special signs after "html"
- and delete that again
Any ideas? Thanks so much.
In Notepad++, in the Replace menu (CTRL+H) you can do the following:
Find: .*?(http\:\/\/www\.[a-zA-Z0-9\.\/\-]+)
Replace: $1\n
Options: check the Regular expression and the . matches newline
This will return you with a list of all your links. There are two issues though:
The regex you provided for matching URLs is far from being generic enough to match any URL. If it is working in your case, that's fine, else check this question.
It will leave the text after the last matched URL intact. You have to delete it manually.
The answer made previously by #psxls was a great help for me when I have wanted to perform a similar process.
However, this regex rule was written six years ago now: accordingly, I had to adjust / complete / update it in order it can properly work with the some recent links, because:
a lot of URL are now using HTTPS instead of HTTP protocol
many websites less use www as main subdomain
some links adds punctuation mark (which have to be preserved)
I finally reshuffle the search rule to .*?(https?\:\/\/[a-zA-Z0-9[:punct:]]+) and it worked correctly with the file I had.
Unfortunately, this seemingly simple task is going to be almost impossible to do in notepad++. The regex you would have to construct would be...horrible. It might not even be possible, but if it is, it's not worth it. I pretty much guarantee that.
However, all is not lost. There are other tools more suitable to this problem.
Really what you want is a tool that can search through an input file and print out a list of regex matches. The UNIX utility "grep" will do just that. Don't be scared off because it's a UNIX utility: you can get it for Windows:
http://gnuwin32.sourceforge.net/packages/grep.htm
The grep command line you'll want to use is this:
grep -o 'http:\/\/www.[a-zA-Z0-9./-]\+\?' <filename(s)>
(Where <filename(s)> are the name(s) of the files you want to search for URLs in.)
You might want to shake up your regex a little bit, too. The problems I see with that regex are that it doesn't handle URLs without the 'www' subdomain, and it won't handle secure links (which start with https). Maybe that's what you want, but if not, I would modify it thusly:
grep -o 'https\?:\/\/[a-zA-Z0-9./-]\+\?' <filename(s)>
Here are some things to note about these expressions:
Inside a character group, there's no need to quote metacharacters except for [ and (sometimes) -. I say sometimes because if you put the dash at the end, as I have above, it's no longer interpreted as a range operator.
The grep utility's syntax, annoyingly, is different than most regex implementations in that most of the metacharacters we're familiar with (?, +, etc.) must be escaped to be used, not the other way around. Which is why you see backslashes before the ? and + characters above.
Lastly, the repetition metacharacter in this expression (+) is greedy by default, which could cause problems. I made it lazy by appending a ? to it. The way you have your URL match formulated, it probably wouldn't have caused problems, but if you change your match to, say [^ ] instead of [a-zA-Z0-9./-], you would see URLs on the same line getting combined together.
I did this a different way.
Find everything up to the first/next (https or http) (then everything that comes next) up to (html or htm), then output just the '(https or http)(everything next) then (html or htm)' with a line feed/ carriage return after each.
So:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace with: \1\2\3\r\n
Saves looking for all possible (incl non-generic) url matches.
You will need to manually remove any text after the last matched URL.
Can also be used to create url links:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace: \1\2\3\r\n
or image links (jpg/jpeg/gif):
Find: .*?(https:|http:)(.*?)(jpeg|jpg|gif)
Replace: <img src="\1\2\3">\r\n
I know my answer won't be RegEx related, but here is another efficient way to get lines containing URLs.
This won't remove text around links like Toto mentioned in comments.
At least if there is nice pattern to all links, like https://.
CTRL+F => change tab to Mark
Insert https://
Tick Mark to bookmark.
Mark All.
Find => Bookmarks => Delete all lines without bookmark.
I hope someone who lands here in search of same problem will find my way more user-friendly.
You can still use RegEx to mark lines :)

Regex with URLs - syntax

We're using a proprietary tracking system that requires the use of regular expressions to load third party scripts on the URLs we specify.
I wanted to check the syntax of the regex we're using to see if it looks right.
To match the following URL
/products/18/indoor-posters
We are using this rule:
.*\/products\/18\/indoor-posters.*
Does this look right? Also, if there was a query parameter on the URL, would it still work? e.g.
/products/18/indoor-posters?someParam=someValue
There's another URL to match:
/products
The rule for this is:
.*\/products
Would this match correctly?
Well, "right" is a relative term. Usually, .* is not a good idea because it matches anything, even nothing. So while these regexes will all match your example strings, they'll also match much more. The question is: What are you using the regexes for?
If you only want to check whether those substrings are present anywhere in the string, then they are fine (but then you don't need regex anyway, just check for substrings).
If you want to somehow check whether it's a valid URL, then no, the regexes are not fine because they'd also match foo-bar!$%(§$§$/products/18/indoor-postersssssss)(/$%/§($/.
If you can be sure that you'll always get a correct URL as your input and just want to check whether they match you pattern, then I'd suggest
^.*\/products$
to match any URL that ends in /products, and
^.*\/products\/18\/indoor-posters(?:\?[\w-]+=[\w-]+)?$
to match a URL that ends in /products/18/indoor-posters with an optional ?name=value bit at the end, assuming only alphanumeric characters are legal for name and value.

Adding http:// to all links without a protocol

I use VB.NET and would like to add http:// to all links that doesn't already start with http://, https://, ftp:// and so on.
"I want to add http here Google,
but not here Google."
It was easy when I just had the links, but I can't find a good solution for an entire string containing multiple links. I guess RegEx is the way to go, but I wouldn't even know where to start.
I can find the RegEx myself, it's the parsing and prepending I'm having problems with. Could anyone give me an example with Regex.Replace() in C# or VB.NET?
Any help appreciated!
Quote RFC 1738:
"Scheme names consist of a sequence of characters. The lower case letters "a"--"z", digits, and the characters plus ("+"), period ("."), and hyphen ("-") are allowed. For resiliency, programs interpreting URLs should treat upper case letters as equivalent to lower case in scheme names (e.g., allow "HTTP" as well as "http")."
Excellent! A regex to match:
/^[a-zA-Z0-9+.-]+:\/\//
If that matches your href string, continue on. If not, prepend "http://". Remaining sanity checks are yours unless you ask for specific details. Do note the other commenters' thoughts about relative links.
EDIT: I'm starting to suspect that you've asked the wrong question... that you perhaps don't have anything that splits the text up into the individual tokens you need to handle it. See Looking for C# HTML parser
EDIT: As a blind try at ignoring all and just attacking the text, using case insensitive matching,
/(<a +href *= *")(.*?)(" *>)/
If the second back-reference matches /^[a-zA-Z0-9+.-]+:\/\//, do nothing. If it does not match, replace it with
$1 + "http://" + $2 + $3
This isn't C# syntax, but it should translate across without too much effort.
In PHP (should translate somewhat easily)
$text = preg_replace('/href="(?:(http|ftp|https)\:\/\/)?([^"]*)"/', 'href="http://$1"', $text);
C#
result = new Regex("(href=\")([^(http|https|ftp)])", RegexOptions.IgnoreCase).Replace(input, "href=\"//$2");
If you aren't concerned with potentially messing up local links, and you can always guarantee that the strings will be fully qualified domain names, then you can simply use the contains method:
Dim myUrl as string = "someUrlString".ToLower()
If Not myUrl.Contains("http://") AndAlso Not myUrl.Contains("https://") AndAlso Not myUrl.Contains("ftp://") Then
'Execute your logic to prepend the proper protocol
myUrl = "http://" & myUrl
End If
Keep in mind this omits a lot of holes regarding the checking of which protocol should be used in the addition and if the url is relative or not.
Edit: I chose specifically not to offer a RegEx solution since this is a simple check and RegEx is a little heavy for it (IMO).