Replace URLs in text with HTML links, UNLESS already link - regex

I am trying to replace URLs in text by an actual HTML URL... UNLESS it's already an HTML URL. I am working on resigning a forum that allowed users to enter HTML, so savvy users have already typed in full URL...
So this should get replaced.
I have found this cool link: http://www.cool-link.com
But this one should be left alone:
I have found this cool link: http://www.cool-link.com
So is there a way to replace "http://..." by a hyperlink, UNLESS said URL is preceded by href=" or in between < and >.
EDIT: The various solutions I found online (including on stackoverflow) would replace both of the above, and I cannot for the life of me change the pattern to meet my needs.
Cheers :)
Alix

The following regex should do it:
/(?<!(href=\")|(\>))(https?:\/\/.+\b)/gi
See it working here: https://regexr.com/4hnns

Related

How to match plain text URL in a markdown?

I'm currently trying to match all plain text links in a markdown text.
Example of the markdown text:
Dude, look at this url http://www.google.com .. it's a great search engine
I would like it to be converted into
Dude, look at this url <http://www.google.com> .. it's a great search engine
So in short, processing url should become <url>, but processing existing <url> shouldnt become <<url>>. Also, the link in the markdown can be in the form of (url), so we'll have to avoid matching the normal brackets too.
So my working regex for matching the plain text url in java is :
"[^(\\<|\\(](https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|][^(\\>|\\)]",
with [^(\\<|\\(] and [^(\\>|\\)] to avoid matching the wrapping brackets.
But here lies one problem where i also do not want to match this kind of url :
[1]: http://slashdot.org
So, if the markdown text is
Dude, look at this url http://www.google.com .. it's a great search engine
[1]: http://slashdot.org
I want only http://www.google.com to be matched, but not the http://slashdot.org.
I wonder what's the pattern to meet this criteria ?
What you have here is a parsing problem. Regexes are fine, but just using regexes here will make it a mess (supposing you achieve it). After you fix this problem, you'll probably find yourself facing other ones, like URL in code (between ` or in lines starting with tabs or four spaces) that you don't want to replace.
A solution would be to split into lines and then
detect patterns (for example ^\[\d+\]:\s+)
apply your replacements (for example this URL to link change) only on lines which doesn't follow an incompatible pattern
That's the logic I use in this small pseudo-markdown parser that you can test here.
Note that there's always the solution to use an existing proved markdown parser, there are many of them.

Regex: Match any string but one ending with thanks/

I am attempting to set up a goal funnel in Google Analytics. It is for an online quote request system that we want to track. Basically all the pages that contain the quote request form have unique dynamically generated urls that are similar. The form of the URL is:
/quoterequest/categoryone/categorytwo/productname/
I have regex that works for tracking that:
^/quoterequest/([A-Za-z0-9/-]+)?
Today we added a thank you page after the user submits the form. The URL is always the same for that:
/quoterequest/thanks/
I would like to modify the above regex so that it continues to match any of the Quote Request URLs, but NOT that thank you URL. I have been trying different variations, including t. he negative look ahead,but unfortunately I am not very experienced with regex and I think I've been doing it completely incorrectly. Can anyone give me some insight as to the correct method of doing this?
You can use:
^\/quoterequest\/(?!thanks\/?$)(?:([A-Za-z0-9\-]+)\/?)*$
See it

search & replace wordpress video shortcode with plain URL using regular expressions

i am transferring a friend's wordpress.com blog to a self-hosted install on my server. problem is, he has many videos embedded in his blog using a shortcode plugin that is not necessary on wordpress 3 (you need only to paste the plain URL to embed videos from YouTube, Vimeo, etc;
I've found a Search Regex plugin that will search & replace using regular expressions, but am unfamiliar with regex myself. how might i catch the url in a shortcode such as [youtube="URL"] and replace it with just the URL?
Thanks for any help you can provide!!
-Jenny
Are you trying to go from "[youtube=http://www.youtube.com/watch?v=JaNH56Vpg-A]" to http://www.youtube.com/watch?v=JaNH56Vpg-A?
This works if there's a white space between different URLs.
find: \[youtube=(\S*)\]
replace with: $1
It's difficult to replace every different service at once since it seems that their short codes are different. For Vimeo this would work. It allows a random number of white space between "vimeo" and URL. And it again needs the white space after closing "]".
find: \[vimeo\s+(\S*)\]
replace with: $1
Maybe theres more robust way to write the expression. (Which validates the correct syntax.) This one's pretty straightforward thought.
The actual regex syntax depend on the language used. Hope this helps.

Why isn't DownThemAll able to recognize my reddit URL regular expression?

So I'm trying to download all my old reddit posts using a combination of AutoPagerize and DownThemAll.
Here are two sample URLs I want to distinguish between:
http://www.reddit.com/r/China/comments/kqjr1/what_is_the_name_of_this_weird_chinese_medicine/c2med97
http://www.reddit.com/r/China/comments/kqjr1/what_is_the_name_of_this_weird_chinese_medicine/c2meana?context=3
The regexp I'm trying to use is this: (\b)http://www.reddit.com/([^?\s]*)?
I want all my reddit posts downloaded, but I don't want any redundancy, so I want to match all of my reddit posts except for anything with a question mark (after which there's a "context=3" character).
I've used RegEx Buddy to show that the regexp fits the first URL but not the second one. However, DownThemAll does not recognize this. Is DownThemAll's ability to parse regexp limited, or am I doing something wrong?
For now, I've just decided to download them all, but to use a renaming mask of *subdirs*.*text*.*html* so that I can later mass remove anything containing the word "context" in its filename.
Reddit does have an API, you might want to take a look at that instead, might be easier.
https://github.com/reddit/reddit/wiki/API
EDIT: Looks like http://www.reddit.com/user/USERNAME/.json might be what you want

I am trying to create an expression that will extract URLs

I want to extract URLs from a webpage these are just URLs by themselves not hyperlinks etc., they are just text. Some examples would be http://www.example.com, http://example.com, www.example.com etc. I am extremely new at regex so I have copy and pasted like 20 expressions online all failed to work. I don't know if I am doing it right or not. Any help would be really appreciated.
I wrote a post on using Regex to locate links within a HTML page (the intent was to use JavaScript to open external links or links to documents such as PDF's etc in a popup window).
The final regex was:
^(?:[./]+)?(?:Assets|https?://(?!(?:www.)?integralist))
The full post is here:
http://www.integralist.co.uk/javascript/regular-expression-to-open-external-links-in-popup-window/
The solution wont be perfect but might help point you in the right direction.
Mark
You're probably not escaping your .s. You need to use \. for each one.
Take a look at strfriend.com. It has a URL example, and represents it graphically.
The example it suggests is:
^((ht|f)tp(s?)://|~/|/)?(\w+:\w+#)?([a-zA-Z]{1}([\w-]+.)+(\w{2,5}))(:\d{1,5})?((/?\w+/)+|/?)(\w+.\w{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?