transforming URLS to active links with REGEX - regex

i have this code in php that transforms URL inside a text to active html links.
For example in a string
Hey check this cool link http://www.example.com
this transforms to:
Hey check this cool link http://www.example.com
As you can see it just adds the correct < a > html tag
The code is this:
$active_links_text = ereg_replace("[[:alpha:]]+://[^<>[:space:]]+[[:alnum:]/]","\\0", $original_text);
My question is, how to do this to work EXCEPT if the URL is a youtube url.
So i want this result: In a string
Wow have you checked http://www.youtube.com/watch?v=dQw4w9WgXcQ its even better than http://www.example.com !!!
i want to be transformed to
Wow have you checked http://www.youtube.com/watch?v=dQw4w9WgXcQ its even better than http://www.example.com
As you can see the < a > html tag was added to the example.com's URL but NOT at the youtube's URL.
How can i make this happen???
I hope i described my problem good enough, i hope its easy to implement this! Last note: i am using this code in php 5.2.14
Thank you guys!

[EDIT : Wow, I had gotten your question completely wrong! Below's a better attempt at helping you.]
I gave it a go in js here, here is the original regex : /(http:\/\/(?!www.youtube)[^<>\s]+)\b/g, since i'm not a php coder. The negative lookahead prevents a litteral www.youtube match (the lookahead content can be adapted if you need a more complex pattern).
There's nothing js-specific here to my knowledge, but I don't know the ereg regex syntax. with preg functions, you would just need not to escape the slashes, the word boundaries \b and negative lookahead (?!*pattern*) are the same. The /g flag is for a global replacement, that is, not stopping on the first match, I suppose you have a kind of replaceAll function in your toolbox.
Also, I'm not sure about the global flag in php, I guess you can just call a kind of replaceAll function.

You've made several mistakes about valid URI components. The scheme is defined as ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ), not [[:alpha:]]+.
The part after the : of the scheme need not start with //, that's particular to http: and a few other file-oriented schemes. But the [[:alpha:]]+: start of your regex shows you weren't aiming to restrict yourself to http:. In that case, all printable ASCII characters are valid. I.e. everything from ! to ~, or [\x21-x7E]* as a regex.
To summarize: [[:alpha:]][A-Za-z0-9+-.]*:[\x21-x7E]*.

Related

Azure frontdoor compliant RegEx rule for /users/*

I'm trying to create a "Azure front door" compliant regex that matches /users/* URL pathname patterns. But not /users/*/ or /users/*/profile or /users/*/<anything at all>
I've tried without escaping as it looks like front door escapes for you.
^/users/([^/]+?)(?:/)?$
and this
^/users/[^/]+?$
But this doesn't work, I'm assuming because of the "?" which would count as a back reference? Any ideas on how to create a compliant regex, happy to try anything for anyone who doesn't have front door to test.
docs:
https://learn.microsoft.com/en-us/azure/frontdoor/rules-match-conditions?pivots=front-door-standard-premium&tabs=portal#regular-expressions
=== EDIT ===
Iv'e accepted the below answer, but I think the culprit is RegEx's matching a preceeding "/", every other operator in front door rules seems to want a preceeding "/" except RegEx, without this it seems to work as expected in some basic testing.
With this in mind I've accepted the below answer but I have a feeling (without trying) that some of the original formats will work.
You need to use
^/users/[^/\s]+$
Details
^ - start of string
/users/ - a fixed string
[^/\s]+ - one or more chars other than / and whitespace
$ - end of string.

How can I use regular expression to match urls starting with https and ending with #?

Very much a newb with regex and having a hard time figuring this one out. I have an HTML document and I want to clear out a ton of URLs that are inside of it. All of the URLs begin with https:// and they all end with a pound sign #.
Any help would be extremely appreciative. Using sublime text for my editor in case that is needed.
A basic way to do it:
\bhttps://[^\s#]+#
free-spaced:
\b //word start
https://
[^\s#]+ //followed by anything but whitespace and '#'
#
If you truly want to clear everything in between the url from https:// [...] # then you can use:
^(https)+(.)*(#)+$
But you may want to be more specific in terms of what you are filtering out. If this is from a database query you should be ok since you can assume the URL will be the content of the field(s) returned the you will be running the regex through a code loop of some kind.
BTW you can hone your scripts using something like http://regexpal.com/

RegEx filter links from a document

I am currently learning regex and I am trying to filter all links (eg: http://www.link.com/folder/file.html) from a document with notepad++. Actually I want to delete everything else so that in the end only the http links are listed.
So far I tried this : http\:\/\/www\.[a-zA-Z0-9\.\/\-]+
This gives me all links which is find, but how do I delete the remaining stuff so that in the end I have a neat list of all links?
If I try to replace it with nothing followed by \1, obviously the link will be deleted, but I want the exact opposite to have everything else deleted.
So it should be something like:
- find a string of numbers, letters and special signs until "http"
- delete what you found
- and keep searching for more numbers, letters ans special signs after "html"
- and delete that again
Any ideas? Thanks so much.
In Notepad++, in the Replace menu (CTRL+H) you can do the following:
Find: .*?(http\:\/\/www\.[a-zA-Z0-9\.\/\-]+)
Replace: $1\n
Options: check the Regular expression and the . matches newline
This will return you with a list of all your links. There are two issues though:
The regex you provided for matching URLs is far from being generic enough to match any URL. If it is working in your case, that's fine, else check this question.
It will leave the text after the last matched URL intact. You have to delete it manually.
The answer made previously by #psxls was a great help for me when I have wanted to perform a similar process.
However, this regex rule was written six years ago now: accordingly, I had to adjust / complete / update it in order it can properly work with the some recent links, because:
a lot of URL are now using HTTPS instead of HTTP protocol
many websites less use www as main subdomain
some links adds punctuation mark (which have to be preserved)
I finally reshuffle the search rule to .*?(https?\:\/\/[a-zA-Z0-9[:punct:]]+) and it worked correctly with the file I had.
Unfortunately, this seemingly simple task is going to be almost impossible to do in notepad++. The regex you would have to construct would be...horrible. It might not even be possible, but if it is, it's not worth it. I pretty much guarantee that.
However, all is not lost. There are other tools more suitable to this problem.
Really what you want is a tool that can search through an input file and print out a list of regex matches. The UNIX utility "grep" will do just that. Don't be scared off because it's a UNIX utility: you can get it for Windows:
http://gnuwin32.sourceforge.net/packages/grep.htm
The grep command line you'll want to use is this:
grep -o 'http:\/\/www.[a-zA-Z0-9./-]\+\?' <filename(s)>
(Where <filename(s)> are the name(s) of the files you want to search for URLs in.)
You might want to shake up your regex a little bit, too. The problems I see with that regex are that it doesn't handle URLs without the 'www' subdomain, and it won't handle secure links (which start with https). Maybe that's what you want, but if not, I would modify it thusly:
grep -o 'https\?:\/\/[a-zA-Z0-9./-]\+\?' <filename(s)>
Here are some things to note about these expressions:
Inside a character group, there's no need to quote metacharacters except for [ and (sometimes) -. I say sometimes because if you put the dash at the end, as I have above, it's no longer interpreted as a range operator.
The grep utility's syntax, annoyingly, is different than most regex implementations in that most of the metacharacters we're familiar with (?, +, etc.) must be escaped to be used, not the other way around. Which is why you see backslashes before the ? and + characters above.
Lastly, the repetition metacharacter in this expression (+) is greedy by default, which could cause problems. I made it lazy by appending a ? to it. The way you have your URL match formulated, it probably wouldn't have caused problems, but if you change your match to, say [^ ] instead of [a-zA-Z0-9./-], you would see URLs on the same line getting combined together.
I did this a different way.
Find everything up to the first/next (https or http) (then everything that comes next) up to (html or htm), then output just the '(https or http)(everything next) then (html or htm)' with a line feed/ carriage return after each.
So:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace with: \1\2\3\r\n
Saves looking for all possible (incl non-generic) url matches.
You will need to manually remove any text after the last matched URL.
Can also be used to create url links:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace: \1\2\3\r\n
or image links (jpg/jpeg/gif):
Find: .*?(https:|http:)(.*?)(jpeg|jpg|gif)
Replace: <img src="\1\2\3">\r\n
I know my answer won't be RegEx related, but here is another efficient way to get lines containing URLs.
This won't remove text around links like Toto mentioned in comments.
At least if there is nice pattern to all links, like https://.
CTRL+F => change tab to Mark
Insert https://
Tick Mark to bookmark.
Mark All.
Find => Bookmarks => Delete all lines without bookmark.
I hope someone who lands here in search of same problem will find my way more user-friendly.
You can still use RegEx to mark lines :)

Adding http:// to all links without a protocol

I use VB.NET and would like to add http:// to all links that doesn't already start with http://, https://, ftp:// and so on.
"I want to add http here Google,
but not here Google."
It was easy when I just had the links, but I can't find a good solution for an entire string containing multiple links. I guess RegEx is the way to go, but I wouldn't even know where to start.
I can find the RegEx myself, it's the parsing and prepending I'm having problems with. Could anyone give me an example with Regex.Replace() in C# or VB.NET?
Any help appreciated!
Quote RFC 1738:
"Scheme names consist of a sequence of characters. The lower case letters "a"--"z", digits, and the characters plus ("+"), period ("."), and hyphen ("-") are allowed. For resiliency, programs interpreting URLs should treat upper case letters as equivalent to lower case in scheme names (e.g., allow "HTTP" as well as "http")."
Excellent! A regex to match:
/^[a-zA-Z0-9+.-]+:\/\//
If that matches your href string, continue on. If not, prepend "http://". Remaining sanity checks are yours unless you ask for specific details. Do note the other commenters' thoughts about relative links.
EDIT: I'm starting to suspect that you've asked the wrong question... that you perhaps don't have anything that splits the text up into the individual tokens you need to handle it. See Looking for C# HTML parser
EDIT: As a blind try at ignoring all and just attacking the text, using case insensitive matching,
/(<a +href *= *")(.*?)(" *>)/
If the second back-reference matches /^[a-zA-Z0-9+.-]+:\/\//, do nothing. If it does not match, replace it with
$1 + "http://" + $2 + $3
This isn't C# syntax, but it should translate across without too much effort.
In PHP (should translate somewhat easily)
$text = preg_replace('/href="(?:(http|ftp|https)\:\/\/)?([^"]*)"/', 'href="http://$1"', $text);
C#
result = new Regex("(href=\")([^(http|https|ftp)])", RegexOptions.IgnoreCase).Replace(input, "href=\"//$2");
If you aren't concerned with potentially messing up local links, and you can always guarantee that the strings will be fully qualified domain names, then you can simply use the contains method:
Dim myUrl as string = "someUrlString".ToLower()
If Not myUrl.Contains("http://") AndAlso Not myUrl.Contains("https://") AndAlso Not myUrl.Contains("ftp://") Then
'Execute your logic to prepend the proper protocol
myUrl = "http://" & myUrl
End If
Keep in mind this omits a lot of holes regarding the checking of which protocol should be used in the addition and if the url is relative or not.
Edit: I chose specifically not to offer a RegEx solution since this is a simple check and RegEx is a little heavy for it (IMO).

Regular expression to add base domain to directory

10 websites need to be cached. When caching: photos, css, js, etc are not displayed properly because the base domain isn't attached to the directory. I need a regex to add the base domain to the directory. examples below
base domain: http://www.example.com
the problem occurs when reading cached pages with img src="thumb/123.jpg" or src="/inc/123.js".
they would display correctly if it was img src="http://www.example.com/thumb/123.jpg" or src="http://www.example.com/inc/123.js".
regex something like: if (src=") isn't followed by the base domain then add the base domain
without knowing the language, you can use the (maybe most portable) substitute modifier:
s/^(src=")([^"]+")$/$1www\.example\.com\/$2/
This should do the following:
1. the string 'src="' (and capture it in variable $1)
2. one or more non-double-quote (") character followed by " (and capture it in variable $2)
3. Substitutes 'www.example.com/' in between the two capture groups.
Depending on the language, you can wrap this in a conditional that checks for the existence of the domain and substitutes if it isn't found.
to check for domain: /www\.example\.com/i should do.
EDIT: See comments:
For PHP, I would do this a bit differently. I would probably use simplexml. I don't think that will translate well, though, so here's a regex one...
$html = file_get_contents('/path/to/file.html');
$regex_match = '/(src="|href=")[^(?:www.example.com\/)]([^"]+")/gi';
$regex_substitute = '$1www.example.com/$2';
preg_replace($regex_match, $regex_substitute, $html);
Note: I haven't actually run this to debug it, it's just off the cuff. I would be concerned about 3 things. first, I am unsure how preg_replace will handle the / character. I don't think you're concerned with this, though, unless VB has a similar problem. Second, If there's a chance that line breaks would get in the way, I might change the regex. Third, I added the [^(?:www\.example\.com)] bit. This should change the match to any src or href that doesn't have www.example.com/ there, but this depends on the type of regex being used (POSIX/PCRE).
The rest of the changes should be fine (I added href=" and also made it case-insensitive (\i) and there's a requirement to make it global (\g) otherwise, it will just match once).
I hope that helps.
Matching regular expression:
(?:src|href)="(http://www\.example\.com/)?.+