I am trying to create an expression that will extract URLs - regex

I want to extract URLs from a webpage these are just URLs by themselves not hyperlinks etc., they are just text. Some examples would be http://www.example.com, http://example.com, www.example.com etc. I am extremely new at regex so I have copy and pasted like 20 expressions online all failed to work. I don't know if I am doing it right or not. Any help would be really appreciated.

I wrote a post on using Regex to locate links within a HTML page (the intent was to use JavaScript to open external links or links to documents such as PDF's etc in a popup window).
The final regex was:
^(?:[./]+)?(?:Assets|https?://(?!(?:www.)?integralist))
The full post is here:
http://www.integralist.co.uk/javascript/regular-expression-to-open-external-links-in-popup-window/
The solution wont be perfect but might help point you in the right direction.
Mark

You're probably not escaping your .s. You need to use \. for each one.
Take a look at strfriend.com. It has a URL example, and represents it graphically.
The example it suggests is:
^((ht|f)tp(s?)://|~/|/)?(\w+:\w+#)?([a-zA-Z]{1}([\w-]+.)+(\w{2,5}))(:\d{1,5})?((/?\w+/)+|/?)(\w+.\w{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?

Related

Replace URLs in text with HTML links, UNLESS already link

I am trying to replace URLs in text by an actual HTML URL... UNLESS it's already an HTML URL. I am working on resigning a forum that allowed users to enter HTML, so savvy users have already typed in full URL...
So this should get replaced.
I have found this cool link: http://www.cool-link.com
But this one should be left alone:
I have found this cool link: http://www.cool-link.com
So is there a way to replace "http://..." by a hyperlink, UNLESS said URL is preceded by href=" or in between < and >.
EDIT: The various solutions I found online (including on stackoverflow) would replace both of the above, and I cannot for the life of me change the pattern to meet my needs.
Cheers :)
Alix
The following regex should do it:
/(?<!(href=\")|(\>))(https?:\/\/.+\b)/gi
See it working here: https://regexr.com/4hnns

Nutch Domain Regular Expression

I am following the tutorial here, trying to build a robot against a website.
I am in a page that contains all the product categories. Say it is www.example.com/allproducts.
After diving into each category. You can see the product list in a table format and you can click the next page to loop through all the pages inside that category. Actually you can only see the 1,2,3,4,5, last page.
The first page in the category has a URL looks like www.example.com/level1/level2/_/N-1, then the second page will looks like www.example.com/level1/level2/_/N-1/?No=100 .. so on an so forth..
I personally don't have that much JAVA programming experience and I am wondering
can I crawl the all the products list page using Nutch and store the HTML for now..
and maybe later figure out a way to parse the html/index correctly.
(1) Can I just modify conf/regex-urlfilter.txt and replace
# accept anything else
+.
with something correct? (I just don't understand how could
+^http://([a-z0-9]*\.)*nutch.apache.org/
only restrict the URLs inside the Nutch domain..., I will interpret that regular expression to be between the double slash and nutch, there could be any characters that are alpha numeric or asterisk, backslash or dot..)
How can I build the regular expression so it only scrape http://www.example.com/.../.../_/N-../...
(2) I can see the HTML is stored in the content folder inside segment... However, when I open that file in VI, it just totally looks like nonsense to me... and I am wondering if that is the so-called JAVA serialization which I need to deserialize in JAVA to read it.
Forgive me if those questions are too basic and thanks a lot for reading.
(1) Can I just modify conf/regex-urlfilter.txt and replace
Sure. You should replace +. with these lines:
#accept all products page
+www\.example\.com/allproducts
#accept categories pages
+www\.example\.com/level1/level2/_/N-
One important note about regex in this file: the regular expressions are partially match. So if you write a rule like "+ab" it means: accept all urls that contain "ab" so it matches with these urls
ab
abc
http://ab.com/c.html
By default, nutch filter urls with ? (since mostly they are dynamic pages). To prevent this, comment this line in you regex-urlfilter.txt file:
-[?*!#=]
(2) I can see the HTML ...
Nutch saves the files in binary format. See https://stackoverflow.com/a/10150402/1881318

How to match plain text URL in a markdown?

I'm currently trying to match all plain text links in a markdown text.
Example of the markdown text:
Dude, look at this url http://www.google.com .. it's a great search engine
I would like it to be converted into
Dude, look at this url <http://www.google.com> .. it's a great search engine
So in short, processing url should become <url>, but processing existing <url> shouldnt become <<url>>. Also, the link in the markdown can be in the form of (url), so we'll have to avoid matching the normal brackets too.
So my working regex for matching the plain text url in java is :
"[^(\\<|\\(](https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|][^(\\>|\\)]",
with [^(\\<|\\(] and [^(\\>|\\)] to avoid matching the wrapping brackets.
But here lies one problem where i also do not want to match this kind of url :
[1]: http://slashdot.org
So, if the markdown text is
Dude, look at this url http://www.google.com .. it's a great search engine
[1]: http://slashdot.org
I want only http://www.google.com to be matched, but not the http://slashdot.org.
I wonder what's the pattern to meet this criteria ?
What you have here is a parsing problem. Regexes are fine, but just using regexes here will make it a mess (supposing you achieve it). After you fix this problem, you'll probably find yourself facing other ones, like URL in code (between ` or in lines starting with tabs or four spaces) that you don't want to replace.
A solution would be to split into lines and then
detect patterns (for example ^\[\d+\]:\s+)
apply your replacements (for example this URL to link change) only on lines which doesn't follow an incompatible pattern
That's the logic I use in this small pseudo-markdown parser that you can test here.
Note that there's always the solution to use an existing proved markdown parser, there are many of them.

Regex: Match any string but one ending with thanks/

I am attempting to set up a goal funnel in Google Analytics. It is for an online quote request system that we want to track. Basically all the pages that contain the quote request form have unique dynamically generated urls that are similar. The form of the URL is:
/quoterequest/categoryone/categorytwo/productname/
I have regex that works for tracking that:
^/quoterequest/([A-Za-z0-9/-]+)?
Today we added a thank you page after the user submits the form. The URL is always the same for that:
/quoterequest/thanks/
I would like to modify the above regex so that it continues to match any of the Quote Request URLs, but NOT that thank you URL. I have been trying different variations, including t. he negative look ahead,but unfortunately I am not very experienced with regex and I think I've been doing it completely incorrectly. Can anyone give me some insight as to the correct method of doing this?
You can use:
^\/quoterequest\/(?!thanks\/?$)(?:([A-Za-z0-9\-]+)\/?)*$
See it

search & replace wordpress video shortcode with plain URL using regular expressions

i am transferring a friend's wordpress.com blog to a self-hosted install on my server. problem is, he has many videos embedded in his blog using a shortcode plugin that is not necessary on wordpress 3 (you need only to paste the plain URL to embed videos from YouTube, Vimeo, etc;
I've found a Search Regex plugin that will search & replace using regular expressions, but am unfamiliar with regex myself. how might i catch the url in a shortcode such as [youtube="URL"] and replace it with just the URL?
Thanks for any help you can provide!!
-Jenny
Are you trying to go from "[youtube=http://www.youtube.com/watch?v=JaNH56Vpg-A]" to http://www.youtube.com/watch?v=JaNH56Vpg-A?
This works if there's a white space between different URLs.
find: \[youtube=(\S*)\]
replace with: $1
It's difficult to replace every different service at once since it seems that their short codes are different. For Vimeo this would work. It allows a random number of white space between "vimeo" and URL. And it again needs the white space after closing "]".
find: \[vimeo\s+(\S*)\]
replace with: $1
Maybe theres more robust way to write the expression. (Which validates the correct syntax.) This one's pretty straightforward thought.
The actual regex syntax depend on the language used. Hope this helps.