Regex to find a web address - regex

I'm trying to isolate links from html using a regex and the one I found that is suppose to do it doesn't seem to work.
/^(http?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
Am I missing something? I'm using Brackets as my text editor

^(?:http|https):\/\/(?:[a-z0-9\-\.]+)(?::[0-9]+)?(?:\/|\/(?:[\w#!:\.\?\+=&%#!\-\/\(\)]+)|\?(?:[\w#!:\.\?\+=&%#!\-\/\(\)]+))?$
Messy, but works.
Also, you might want to look at a similar question: Regex expression for valid website link
Hope this helps :)

It is hard to make it 100% accurate.
A url could also be a IP address for example.
http://ip/
It can contain query strings.
http://www.google.com/?a=1&b=2
It can contain spaces.
http://www.google.com/this is my url/
It depends on what need you have for accuracy.

Related

Regex: How to parse images not only the <img> tags

For the last whole week I have been busting my head to find a Regular Expression which could parse all the images in a html source file. I know that there are many out there but mainly they parse the tags. The tricky part is sometimes the images are in javascript and sometimes they have weird long formats such as :
http://pinterest.com/pin/create/button/?url=http://www.designscene.net/2015/07/binx-walton-josephine-le-tutour-vera-wang.html&media=http://www.designscene.net/wp-content/uploads/2015/07/Vera-Wang-Fall-Winter-2015-Patrick-Demarchelier-03-620x806.jpg&description=Binx Walton and Josephine Le Tutour for Vera Wang FW15
I have tried negative look heads and booleans but could not find a good solution. Please give me a perspective.
Well as you said there are many ways to do that and to be honest there is not a regex solution which could parse all the html files out there.. I have tried it in the past as well. For me the below worked the best :
/(?:.(?!http|\,))+(\.jpg|\.png)
A bit of an explanation:
/......(.jpg|.png) starts from the first slash it finds until finds an image ext
. any char between the slash and the ext
(?:.(?!http|\,))+ omit if there is http or , in it (works like a charm for the example link you have given
Hope it helps, regex is a very complex world. You can write the same exp in so many different ways. May be there is a better solution then I suggest.
Would this help?
https://regex101.com/r/jP4tV7/4
(http[^&"']+(?:jpg|gif|jpeg|png))(?:\&|'|")
You should be able to search for any url that ends in an image-extension. This quick and dirty expression should do it
(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})[\/\w \.-]*(jpg|png|gif|jpeg|tif|tiff)
Available at: http://regexr.com/3bg8o

Regular expression to exclude local addresses

I'm trying to configure my Foxy Proxy program and one of the features is to provide a regular expression for an exclusion list.
I'm trying to blacklist the local sites (ending in .local), but it doesn't seem to work.
This is what I attempted:
^(?:https?://)?\d+\.(?!local)+/.*$
^(?:https?://)?\d+\.(?!local)(\d)+/.*$
I also researched on Google and Stack Exchange with no success.
Since you indicate in the comments that you actually need a whitelist solution, I went with that:
Try: ^(?:https?://)?[\w.-]+\\.(?!local)\w+/.*$
http://regex101.com/r/xV4gS0
Your regex expressions match host names which start with a series of digits followed by a period and then not followed by the string "local". If this is a "blacklist", then that hardly seems like what you want.
If you're trying to match all hostnames which end in .local, you'd want something like the following for the hostname portion:
[^/]*\.local(?:/|$)
with appropriate escapes inserted depending on regex context.
If your original question was incorrect and you really need a whitelist, then you'd want something like:
^(?:(?!\.local)[^\/])*(?:\/|$)
as illustrated in http://regex101.com/r/yB0uY4
Thank you everyone to help. Indeed, it turns out that for this program, enlisting "not .local" as blacklist, it's not the same as "all .local" as whitelist.
I also had a rookie mistake on my pattern. I meant "\w" instead of "\d". Thank you Peter Alfvin for catching that.
So my final working solution is what Bart suggested:
^(?:https?://)?[\w.-]+\.(?!local)\w+/.*$ as a whitelist.

Regex help: Identifying websites in text

I am trying to write a function which removes websites from a piece of text. I have:
removeWebsites<- function(text){
text = gsub("(http://|https://|www.)[[:alnum:]~!#$%&+-=?,:/;._]*",'',text)
return(text)
}
This handles a large set of the problem, but not a popular one, i.e something of the form xyz.com
I do not wish to add .com at the end of the above regex, as it limits the scope of that regex. However I tried writing some more regexex like:
gsub("[[:alnum:]~!#$%&+-=?,:/;._]*.com",'',testset[10])
This worked, but it also modified email ids of the form abc#xyz.com to abc#. I don't want this, so I modified it to
gsub("*((^#)[[:alnum:]~!#$%&+-=?,:/;._]*).com",'\\1',testset[10])
This left the email ids alone but stopped recognising websites of the form xyz.com
I understand that I need some sort of a set difference here, of the form of what was explained here but I was not able to implement it (mainly because I was not able to completely understand it). Any idea on how I go about solving my problem?
Edit: I tried negative lookaheads:
gsub("[[:alnum:]~!#$%&+-=?,:/;._](?!#)[^(?!.*#)]*.com",'',testset[10])
I got a 'invalid regex' error. I believe a little help in correcting may get this to work...
I can't believe it. There actually is a simple solution to it.
gsub(" ([[:alnum:]~!#$%&+-=?,:/;._]+)((.com)|(.net)|(.org)|(.info))",' ',text)
This works by:
Start with a space.
Put all sorts of things, except an '#' in.
end with a .com/net/org/info/
Please do look into breaking it! I'm sure there will be cases that will break this as well.
your lookarounds look a bit funny to me: you cant look behind inside a character class and why are you looking ahead? A look behind is imho more appropriate.
I think the following expression should work, although i didn't test it:
gsub("*((?<!#)[[:alnum:]~!#$%&+-=?,:/;._]*).com",'\\1',testset[10])
also note that lookbehinds must have a fixed length, so no multipliers are allowed

RegEx all URLs that do NOT contain a string

I seem to be having a bit of a brain fart atm. I've got Google counting my transitions correctly but I'm getting false positives.
This is the current goal RegEx which works great.
^/click/[0-9]+\.html\?.*
But I also want it the RegEx to NOT county anything that has &confirm=1 I'm quite stuck as to how to do that in the RegEx, I thought I might be able to use [^(?:&confirm=1)] but I don't think that's valid.
Use "exclude", not "include" filter option
Try this:
^/click/[0-9]+\.html\?(?!.*\bconfirm=1).*
I changed it slightly so it will still exclude if confirm=1 is the first param (preceded by the ? rather than &)
I'm afraid you can't... I've tried doing this before, what I found was that you used to be able to do this with negative lookahead (see Rubens), but Google Analytics stopped supporting this at some point (source: http://productforums.google.com/forum/#!topic/analytics/3YnwXM0WYxE).
Maybe I'm a little late.
What about just writing :
[^(&confirm=1)]
?

Get value between <b> tag using regex in Yahoo Pipes

I have searched up and down trying to find an answer that will work for me but haven't been able to figure this out. I'm using Yahoo Pipes for this.
Lake Harmony Estates <b>Sleeps: 16</b>
What I need to do is extract the Sleeps: 16 out from the B tag and output just that value and nothing else. I don't suspect this is very hard to do, but given my limited regex knowledge it's giving me troubles. I've tried adapting regex code pertaining to other tags, but just can't seem to get this one to work.
Any help on this would be appreciated. Thanks.
Edit:
Here is my pipe if you wanted to take a look at the regex horrible-ness I've created. The one I'm trying to work though is the item.sleeps, last entry in the 2nd regex
http://pipes.yahoo.com/pipes/pipe.info?_id=567026d850223b0075d80fd3c9bf7e75
This should fit your needs assuming the html isn't ladened with quotes and such. Note that the + will mean that empty <b> tags are ignored. Also, html is not truly passable via regex, so this will only work for basic tags. It should work even if the tag has an ID or a class property, but there are absolutely manners to break this regex.
/<b[^>]*>([^<]+)<\/b>/
I posted this question to Twitter and got a response back that worked for me.
(?s)^.*<b>(.*?)</b>.*
Replace with $1 and have G flag checked.
This solution did everything I needed. I had additional data that I had already excluded in my example that became unnecessary with this regex.