How to match plain text URL in a markdown? - regex

I'm currently trying to match all plain text links in a markdown text.
Example of the markdown text:
Dude, look at this url http://www.google.com .. it's a great search engine
I would like it to be converted into
Dude, look at this url <http://www.google.com> .. it's a great search engine
So in short, processing url should become <url>, but processing existing <url> shouldnt become <<url>>. Also, the link in the markdown can be in the form of (url), so we'll have to avoid matching the normal brackets too.
So my working regex for matching the plain text url in java is :
"[^(\\<|\\(](https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|][^(\\>|\\)]",
with [^(\\<|\\(] and [^(\\>|\\)] to avoid matching the wrapping brackets.
But here lies one problem where i also do not want to match this kind of url :
[1]: http://slashdot.org
So, if the markdown text is
Dude, look at this url http://www.google.com .. it's a great search engine
[1]: http://slashdot.org
I want only http://www.google.com to be matched, but not the http://slashdot.org.
I wonder what's the pattern to meet this criteria ?

What you have here is a parsing problem. Regexes are fine, but just using regexes here will make it a mess (supposing you achieve it). After you fix this problem, you'll probably find yourself facing other ones, like URL in code (between ` or in lines starting with tabs or four spaces) that you don't want to replace.
A solution would be to split into lines and then
detect patterns (for example ^\[\d+\]:\s+)
apply your replacements (for example this URL to link change) only on lines which doesn't follow an incompatible pattern
That's the logic I use in this small pseudo-markdown parser that you can test here.
Note that there's always the solution to use an existing proved markdown parser, there are many of them.

Related

Regular expression for finding embedded javascript urls

First of all, sorry for the question name. Regex problems are hard to name.
I'm building a program for code-reviewing javascript files. The approach is black-box, so all we get is the html code from a web page for example.
The idea is to find all the javascript files present in the code and then analyzing them with some tool.
Im having some issues with finding the javascript files, mainly because each webpage is sort of different so something that works for every webpage is complicated.
I have found the following problems with solutions.
Case I
text = '"somenameforafile.js"'
js_found = re.findall('"(.+?).js"', text)
Case II
text = '"https://somenameforafile.js"'
js_found_2 = re.findall('"https://(.+?).js"',get_text)
In case II I can catch things like s3.amazonaws.bucketname with some further filtering
The problem is that Im finding things like the following (js is at the end):
setTimeout(ld,100)}a.P(1);var j="appendChild",h="createElement",k="src",n=d[h]("div"),v=n[j](d[h](z)),b=d[h]("iframe"),g="document",e="domain",o;n.style.display="none";m.insertBefore(n,m.firstChild).id=z;b.frameBorder="0";b.id=z+"-loader";if(/MSIE[ ]+6/.test(navigator.userAgent)){b.src="javascript:false"}b.allowTransparency="true";v[j](b);try{b.contentWindow[g].open()}catch(w){c[e]=d[e];o="javascript:var d="+g+".open();d.domain='"+d.domain+"';";b[k]=o+"void(0);"}try{var t=b.contentWindow[g];t.write(p());t.close()}catch(x){b[k]=o+'d.write("'+p().replace(/"/g,String.fromCharCode(92)+'"')+'");d.close();'}a.P(2)};ld()};nt()})({loader: "static.olark.com/jsclient/loader0.js",name:"olark",methods:["configure","extend","declare","identify"]});
Expected Output:
static.olark.com/jsclient/loader.js
Which could go into approach I, problem is I get basically all the text with that approach. Is it there any easy way to get urls embedded so much into random text?
You could use a negated character class [^\s"]+ to match 1 or more times not a whitespace char or a double quote and capture that in group 1.
Then match the js part \.js\b by escaping the dot and add a word boundary after js to prevent is being part of a larger word.
([^\s"]+)\.js\b
Regex demo

use regex to get both link and text associated with it (anchor tag)

I created a regex string that I hoped would get both the link and the associated text in an html page. For instance, if I had a link such as:
<a href='www.la.com/magic.htm'>magicians of los angeles</a>
Then the link I want is 'www.la.com/magic.htm' and the text I want is 'magicians of los angeles'.
I used the following regex expression:
strsearch = "\<a\s+(.*?)\>(.*?)\</a\s*?\>|"
But my vb program told me I was getting too many matches.
Is there something wrong with the regEx expression?
The circle-brackets are meant to get 'groups' that can be back-referenced.
Thanks
What about this one:
\<a href=.+\</a>
All there is left to do is to go over each match and extract the substrings using regular string manipulation.
Check here (although regexr follows javascript regex implementation, it is still useful in our scenario)
With that being said, I often see people stating that regexes are not suited for parsing Html. You might need to use an Html Parser for this. You have HtmlAgilityPack, which is not maintained anymore, and AngleSharp, that I know of to recommend.
I tried with following pattern , it worked.
\<a href=(.*?)\>(.*?)\<\/a\s*?\>|
Also Found two errors on your origin string:
missed a escape syntax on /a
the reserved word 'href' is captured on
first group
At last , i would like recommend you a great site to test REGEX string. It will helps your debug really fast. Refer this (also demonstrating the result you want) :
REGEX101

R regular expression: http matching

I'm having problems using regular expression to match http links. I have a pattern that i would like to extract from a websites source code. The source code has 200+ lines with lots of HTML gibberish like </html><body... useless links useless images'
The http links that I need fall under this pattern:
<a href"http:www.google.com/....1,1">
<a href"http:www.google.com/....2,2">
<a href"http:www.google.com/....3,3">
I just want to get the http links, and the unique pattern to them is the ending. Please help, I've been stuck for hours experimenting with gusb, regxpr and grep.
Regular expressions are difficult to match to a generic URL (URL Matching), however if you are always looking to match that exact pattern you can try this
`http:www\.google\.com/.*?(\d+),\1`
This will search for http:www.google.com followed by anything and ending with the same two numbers on each side of the comma, which is what it appears you want from the pattern you displayed.

search & replace wordpress video shortcode with plain URL using regular expressions

i am transferring a friend's wordpress.com blog to a self-hosted install on my server. problem is, he has many videos embedded in his blog using a shortcode plugin that is not necessary on wordpress 3 (you need only to paste the plain URL to embed videos from YouTube, Vimeo, etc;
I've found a Search Regex plugin that will search & replace using regular expressions, but am unfamiliar with regex myself. how might i catch the url in a shortcode such as [youtube="URL"] and replace it with just the URL?
Thanks for any help you can provide!!
-Jenny
Are you trying to go from "[youtube=http://www.youtube.com/watch?v=JaNH56Vpg-A]" to http://www.youtube.com/watch?v=JaNH56Vpg-A?
This works if there's a white space between different URLs.
find: \[youtube=(\S*)\]
replace with: $1
It's difficult to replace every different service at once since it seems that their short codes are different. For Vimeo this would work. It allows a random number of white space between "vimeo" and URL. And it again needs the white space after closing "]".
find: \[vimeo\s+(\S*)\]
replace with: $1
Maybe theres more robust way to write the expression. (Which validates the correct syntax.) This one's pretty straightforward thought.
The actual regex syntax depend on the language used. Hope this helps.

Program to generate regex easily?

Let's say I have a url such as...
http://www.example.com/random-garbage-here-i-dont-want-12392/video2983439
Is there a program where I can just put this test string in, highlight/select the parts I want to keep, then get rid of the rest and turn it into a regex expression to use? I just can't figure out regex for the life of me.
I am trying to scrape URLs on a website but they are all unique except for a few consistent characteristics. The consistent characteristics are highlighted in bold above that I want to keep, while ignoring all the non-bold...that way when I'm crawling the website it will follow URLs that are similar to the bolded parts.
The following code worked for me in TCL
% regexp -- {http://www.example.com/[a-zA-Z0-9-]*/video[0-9]*} http://www.example.com/random-garbage-here-i-dont-want-1
2392/video2983439
1
%