RegEx : replace all Url-s that are not anchored - regex

I'm trying to replace Urls contained inside a HTML code block the users post into an old web-app with proper anchors (<A>) for those Urls.
The problem is that Urls can be already 'anchored', that is contained in <A> elements. Those Url should not be replaced.
Example:
http://noreplace.com <- do not replace
<u>http://noreplace.com</u> <- do not replace
...http://replace.com <- replace
What would the regex to match only 'not anchored Urls' look like?
I use the following function to replace with RegEx:
Function ReplaceRegExp(strString, strPattern, strReplace)
Dim RE: Set RE = New RegExp
With RE
.Pattern = strPattern
.IgnoreCase = True
.Global = True
ReplaceRegExp = .Replace(strString, strReplace)
End With
End Function
The following non greedy regex is used to format UBB URLs. Can this regex be adapted to match only the ones I need?
' the double doublequote in the brackets is because
' double doublequoting is ASP escaping for doublequotes
strString = ReplaceRegExp(strString, "\[URL=[""]?(http|ftp|https)(:\/\/[\w\-_]+)((\.[\w\-_]+)+)([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?[""]?\](.*?)\[/URL\]", "$6")
If this really cannot be done with RegEx, what would be the solution in ASP Classic, with some code or pseudocode please? However I would really like to keep code simple with an additional regex line than add additional functions to this old code.
Thanks for your effort!

Seems like regular expressions are too complex to use for this kind of job so I went to my rusty VBScript skills and coded a function that first removes anchors and then replaces the URLs.
Here it is if somebody may need it:
Function Linkify(Text)
Dim regEx, Match, Matches, patternURLs, patternAnchors, lCount, anchorCount, replacements
patternURLs = "((http|ftp|https)(:\/\/[\w\-_]+)((\.[\w\-_]+)+)([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)"
patternAnchors = "<a[^>]*?>.*?</a>"
Set replacements=Server.CreateObject("Scripting.Dictionary")
' Create the regular expression.
Set regEx = New RegExp
regEx.Pattern = patternAnchors
regEx.IgnoreCase = True
regEx.Global = True
' Do the search for anchors.
Set Matches = regEx.Execute(Text)
lCount = 0
' Iterate through the existing anchors and replace with a placeholder
For Each Match in Matches
key = "<#" & lCount & "#>"
replacements.Add key, Match.Value
Text = Replace(Text,Cstr(Match.Value),key)
lCount = lCount+1
Next
anchorCount = lCount
' we now search for URls
regEx.Pattern = patternURLs
' create anchors from URLs
Text = regEx.Replace(Text, "$1")
' put back the originally existing anchors
For lCount = 0 To anchorCount-1
key = "<#" & lCount & "#>"
Text = Replace(Text,key, replacements.Item(key))
Next
Linkify = Text
End Function

The answer you're looking for is in negative and positive look aheads and look behinds
This article gives a pretty good overview: http://www.regular-expressions.info/lookaround.html
Here's the Regular Expression I've formulated for your case:
(?<!"|>)(ht|f)tps?://.*?(?=\s|$)
Here's some sample data I matched against:
#Matches
http://www.website.com
https://www.website.com
This is a link http://www.website.com that is not linked
This is a long link http://www.website.com/index.htm?foo=bar
ftp://www.website.com
#No Matches
<u>http://www.website.com</u>
http://website.com
http://website.com
<u>http://www.website.com</u>
ftp://www.website.com
Here's a breakdown of what the regular expression is doing:
(?<!"|>)
A negative look behind, making sure what matches next isn't preceded by a " or >
(ht|f)tps?://.*?
This looks for http, https, or ftp and anything following it. It'll also match ftps! If you want to avoid this, you could use (https?|ftp)://.*? instead
(?=\s|$)
This is a positive look ahead, which matches a space or end of line.
EXTRA CREDIT
(ht)?(?(1)tps?|ftp)://
This will match http/https/ftp but not ftps, this may be a bit overkill when you can use (https?|ftp):// but it's an awesome example of if/else in regex.

Some design issues you're going to have to work around:
Embedded URLs could be absolute or relative and may not include the protocol.
Your HTML may not have quotes around attribute values.
The character right after a URL may also be a valid URL character.
There are lots of valid URL characters these days.
If you can assume (1) absolute URLs with protocols and (2) quoted HTML attributes and (3) people will have whitespace after a URL and (4) you're sticking with supporting only basic URL characters, you can just look for URLs not preceded by a double-quote.
Here's an overly-simple example to start with (untested):
(?<!")((http|https|ftp)://[^\s<>])(?=\s|$) replaced with $1
The [^\s<>] part above is ridiculously greedy, so all of the fun will be in tweaking that to build a character set that fits the URLs your users are typing in. Your example shows a much more involved character class with \w plus a hodge-podge of other allowed characters, so you could start there if you want.

Related

RegEx specific numeric pattern in Excel VBS

I do not have much RegEx experience and need advice to create a specific Pattern in Excel VBA.
The Pattern I want to match on to validate a Userform field is: nnnnnn.nnn.nn where n is a 0-9 digit.
My code looks like this but Reg.Test always returns false.
Dim RegEx As Object
Set RegEx = CreateObject("VBScript.RegExp")
With RegEx
.Pattern = "/d/d/d/d/d/d\./d/d/d\./d/d"
End With
If RegEx.Test(txtProjectNumber.Value) = False Then
txtProjectNumber.SetFocus
bolAllDataOK = False
End If
Try this. You need to match the whole contents of the textbox (I assume) so use anchors (^ and $).
Your slashes were the wrong way round. Also you can use quantifiers to simplify the pattern.
Private Sub CommandButton1_Click()
Dim RegEx As Object, bolAllDataOK As Boolean
Set RegEx = CreateObject("VBScript.RegExp")
With RegEx
.Pattern = "^\d{6}\.\d{3}\.\d{2}$"
End With
If Not RegEx.Test(txtProjectNumber.Value) Then
txtProjectNumber.SetFocus
bolAllDataOK = False
End If
End Sub
VBA got it's own build-in alternative called Like operator. So besides the fact you made an error with forward slashes instead of backslashes (as #SJR rightfully mentioned), you should have a look at something like:
If txtProjectNumber.Value Like "######.###.##" Then
Where # stands for any single digit (0–9). Though not as versatile as using regular expressions, it seems to do the trick for you. That way you won't have to use any external reference nor extra object.

Regex to extract second word from URL

I want to extract a second word from my url.
Examples:
/search/acid/all - extract acid
/filter/ion/all/sss - extract ion
I tried to some of the ways
/.*/(.*?)/
but no luck.
A couple things:
The forward slashes / have to be escaped like this \/
The (.*?) will match the least amount of any character, including zero characters. In this case it will always match with an empty string.
The .* will take as many characters as it can, including forward slashes
A simple solution will be:
/.+?\/(.*?)\//
Update:
Since you are using JavaScript, try the following code:
var url = "/search/acid/all";
var regex = /.+?\/(.*?)\//g;
var match = regex.exec(url);
console.log(match[1]);
The variable match is a list. The first element of that list is a full match (everything that was matched), you can just ignore that, since you are interested in the specific group we wanted to match (the thing we put in parenthesis in the regex).
You can see the working code here
This regex will do the trick:
(?:[^\/]*.)\/([^\/]*)\/
Proof.
For me, I had difficulties with the above answers for URL without an ending forward slash:
/search/acid/all/ /* works */
/search/acid /* doesn't work */
To extract the second word from both urls, what worked for me is
var url = "/search/acid";
var regex = /(?:[^\/]*.)\/([^\/]*)/g;
var match = regex.exec(url);
console.log(match[1]);

ColdFusion - How to get only the URL's in this block of text?

How can I extract only the URL's from the given block of text?
background(http://w1.sndcdn.com/f15ikDS9X_m.png)
background-image(http://w1.sndcdn.com/5ikDIlS9X_m.png)
background('http://w1.sndcdn.com/m1kDIl9X_m.png')
background-image('http://w1.sndcdn.com/fm15iIlS9X_m.png')
background("http://w1.sndcdn.com/fm15iklS9X_m.png")
background-image("http://w1.sndcdn.com/m5iIlS9X_m.png")
Perhaps Regex would work, but I'm not advanced enough to work it out!
Many thanks!
Mikey
You're over-thinking the problem - all you need to do is match the URLs, which is a simple match:
rematch('\bhttps?:[^)''"]+',input)
That'll work based on the input provided - might need tweaking if different input used.
(e.g. You can optionally add a \s into the char class if that might be a factor.)
The regex itself is simple:
\bhttps?: ## look for http: or https: with no alphanumeric chars beforehand.
[^)'"]+ ## match characters that are NOT ) or ' or "
## match as many as possible, at least one required.
If this is matching false positives, you can of course look for a more refined URL regex, such as these.
DEMO
background(?:-image)?\((["']?)(?<url>http.*)\1\)
Explanation:
background(?:-image)? -> It matches background or background-image (without grouping)
\( -> matches a literal parentheses
(["']?) -> matches if there is a ' or " or VOID before the url
(?<url>http.*) -> matches the url
\1\) -> matches the grouped (third line of this explanation) and then a literal parentheses
If you want an answer without regular expressions, something like this will work.
YourString = "background(http://w1.sndcdn.com/f15ikDS9X_m.png)";
YourString = ListLast(YourString, "("); // yields http://w1.sndcdn.com/f15ikDS9X_m.png)
YourString = replace(YourString, ")", ""); // http://w1.sndcdn.com/f15ikDS9X_m.png
Since you are doing it more than once, you can make it a function. Also, you might need some other replace commands to handle the quotes that are in some of your strings.
Having said all that, getting a regex to work would be better.

Extract pattern from string, with special characters, using Regular Expressions

I am trying to use a regex in VB.NET - the language probably shouldn't matter though - I am trying to extract something reasonable out of a very large file name, "\\path\path\path.path.path\path\some_more_stuff_from a name.item_123_456.html"
I would like to extract, from that whole mess, the "item_123_456"
It seems to make sense that I can get everything before a pattern like ".html" , and from it, everything after the last dot ?
I have tried to get at least the last part (the entire string before .html) and I still get no matches:
Dim matches As MatchCollection
Dim regexStuff As New Regex(".*\\.html")
matches = regexStuff.Matches(strINeed)
Dim successfulMatch As Match
For Each successfulMatch In matches
strFound = successfulMatch.Value
Next
The match I experimented with, hoping I might even get everything between a dot and an .html: Regex("\\..*\\.html") returned Nothing as well.
I just can't get regular expressions to work...
.*\.(.*?)\.html
This finds as many characters as possible .* until it comes to ( a dot followed by as few characters as possible followed by a dot html ) (\.(.*?)\.html)
It places the stuff between the dot html and the dot preceding the dot html into a capturing group, which should be in $1. If you need the vb.net code for that I can likely get that as well, but your code looked okay
Your vb code should look something like this:
Dim matches As MatchCollection
Dim regexStuff As New Regex(".*\.(.*?)\.html")
matches = regexStuff.Matches(strINeed)
strFound = matches.Item(0).Groups(1).Value.ToString
It could probably be generalized into this
[^.\\]+\.html
Edit: or, initial dot required
\.[^.\\]+\.html

RegEx pattern to extract URLs

I have to extract all there is between this caracters:
<a href="/url?q=(text to extract whatever it is)&amp
I tried this pattern, but it's not working for me:
/(?<=url\?q=).*?(?=&amp)/
I'm programming in Vb.net, this is the code, but I think that the problem is that the pattern is wrong:
Dim matches As MatchCollection
matches = regex.Matches(TextBox1.Text)
For Each Match As Match In matches
listbox1.items.add(Match.Value)
Next
Could you help me please?
Your regex is seemed to be correct except the slash(/) in the beginning and ending of expression, remove it:
Dim regex = New Regex("(?<=url\?q=).*?(?=&amp)")
and it should work.
Some utilities and most languages use / (forward slash) to start and end (de-limit or contain) the search expression others may use single quotes. With System.Text.RegularExpressions.Regex you don't need it.
This regex code below will extract all urls from your text (or any other):
(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?