RegEx pattern to extract URLs - regex

I have to extract all there is between this caracters:
<a href="/url?q=(text to extract whatever it is)&amp
I tried this pattern, but it's not working for me:
/(?<=url\?q=).*?(?=&amp)/
I'm programming in Vb.net, this is the code, but I think that the problem is that the pattern is wrong:
Dim matches As MatchCollection
matches = regex.Matches(TextBox1.Text)
For Each Match As Match In matches
listbox1.items.add(Match.Value)
Next
Could you help me please?

Your regex is seemed to be correct except the slash(/) in the beginning and ending of expression, remove it:
Dim regex = New Regex("(?<=url\?q=).*?(?=&amp)")
and it should work.
Some utilities and most languages use / (forward slash) to start and end (de-limit or contain) the search expression others may use single quotes. With System.Text.RegularExpressions.Regex you don't need it.

This regex code below will extract all urls from your text (or any other):
(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?

Related

Regular expression formatting issue

I'm VERY new to using regular expressions, and I'm trying to figure something simple out.
I have a simple string, and i'm trying to pull out the 590111 and place it into another string.
HMax_590111-1_v8980.bin
So the new string would simply be...
590111
The part number will ALWAYS have 6 digits, and ALWAYS have a version and such. The part number might change location inside of the string.. so it needs to be able to work if it's like this..
590111-1_v8980_HMXAX.bin
What regex expression will do this? Currently, i'm using ^[0-9]* to find it if it's in the front of the file.
Try the following Regex:
Dim text As String = "590111-1_v8980_HMXAX.bin"
Dim pattern As String = "\d{6}"
'Instantiate the regular expression object.
Dim r As Regex = new Regex(pattern, RegexOptions.IgnoreCase)
'Match the regular expression pattern against a text string.
Dim m As Match = r.Match(text)
In Regex \d denotes numerics, so first you write \d.
Then as you know there will be a fix length of numbers which can be specified in Regex with "{}". If you specify \d{6} it means it will expect 6 continuous occurrences of a numeric character.
I would recommend to use this site to try your own expressions. Here you can also find a little bit of information about the expressions you are building if you hover over it.
Regex Tester

Regular expression to find specific text within a string enclosed in two strings, but not the entire string

I have this type of text:
string1_dog_bit_johny_bit_string2
string1_cat_bit_johny_bit_string2
string1_crocodile_bit_johny_bit_string2
string3_crocodile_bit_johny_bit_string4
string4_crocodile_bit_johny_bit_string5
I want to find all occurrences of “bit” that occur only between string1 and string2. How do I do this with regex?
I found the question Regex Match all characters between two strings, but the regex there matches the entire string between string1 and string2, whereas I want to match just parts of that string.
I am doing a global replacement in Notepad++. I just need regex, code will not work.
Thank you in advance.
Roman
If I understand correctly here a code to do what you want
var intput = new List<string>
{
"string1_dog_bit_johny_bit_string2",
"string1_cat_bit_johny_bit_string2",
"string1_crocodile_bit_johny_bit_string2",
"string3_crocodile_bit_johny_bit_string4",
"string4_crocodile_bit_johny_bit_string5"
};
Regex regex = new Regex(#"(?<bitGroup>bit)");
var allMatches = new List<string>();
foreach (var str in intput)
{
if (str.StartsWith("string1") && str.EndsWith("string2"))
{
var matchCollection = regex.Matches(str);
allMatches.AddRange(matchCollection.Cast<Match>().Select(match => match.Groups["bitGroup"].Value));
}
}
Console.WriteLine("All matches {0}", allMatches.Count);
This regex will do the job:
^string1_(?:.*(bit))+.*_string2$
^ means the start of the text (or line if you use the m option like so: /<regex>/m )
$ means the end of the text
. means any character
* means the previous character/expression is repeated 0 or more times
(?:<stuff>) means a non-capturing group (<stuff> won't be captured as a result of the matching)
You could use ^string1_(.*(bit).*)*_string2$ if you don't care about performance or don't have large/many strings to check. The outer parenthesis allow multiple occurences of "bit".
If you provide us with the language you want to use, we could give more specific solutions.
edit: As you added that you're trying a replacement in Notepad++ I propose the following:
Use (?<=string1_)(.*)bit(.*)(?=_string2) as regex and $1xyz$2 as replacement pattern (replace xyz with your string). Then perform an "replace all" operation until N++ doesn't find any more matches. The problem here is that this regex will only match 1 bit per line per iteration - and therefore needs to be applied repeatedly.
Btw. even if a regexp matches the whole line, you can still only replace parts of it using capturing groups.
You can use the regex:
(?:string1|\G)(?:(?!string2).)*?\Kbit
regex101 demo. Tried it on notepad++ as well and it's working.
There're description in the demo site, but if you want more explanations, let me know and I'll elaborate!

Extract pattern from string, with special characters, using Regular Expressions

I am trying to use a regex in VB.NET - the language probably shouldn't matter though - I am trying to extract something reasonable out of a very large file name, "\\path\path\path.path.path\path\some_more_stuff_from a name.item_123_456.html"
I would like to extract, from that whole mess, the "item_123_456"
It seems to make sense that I can get everything before a pattern like ".html" , and from it, everything after the last dot ?
I have tried to get at least the last part (the entire string before .html) and I still get no matches:
Dim matches As MatchCollection
Dim regexStuff As New Regex(".*\\.html")
matches = regexStuff.Matches(strINeed)
Dim successfulMatch As Match
For Each successfulMatch In matches
strFound = successfulMatch.Value
Next
The match I experimented with, hoping I might even get everything between a dot and an .html: Regex("\\..*\\.html") returned Nothing as well.
I just can't get regular expressions to work...
.*\.(.*?)\.html
This finds as many characters as possible .* until it comes to ( a dot followed by as few characters as possible followed by a dot html ) (\.(.*?)\.html)
It places the stuff between the dot html and the dot preceding the dot html into a capturing group, which should be in $1. If you need the vb.net code for that I can likely get that as well, but your code looked okay
Your vb code should look something like this:
Dim matches As MatchCollection
Dim regexStuff As New Regex(".*\.(.*?)\.html")
matches = regexStuff.Matches(strINeed)
strFound = matches.Item(0).Groups(1).Value.ToString
It could probably be generalized into this
[^.\\]+\.html
Edit: or, initial dot required
\.[^.\\]+\.html

Need to extract text from within first curly brackets

I have strings that look like this
{/CSDC} CHOC SHELL DIP COLOR {17}
I need to extract the value in the first swirly brackets. In the above example it would be
/CSDC
So far i have this code which is not working
Dim matchCode = Regex.Matches(txtItems.Text, "/\{(.+?)\}/")
Dim itemCode As String
If matchCode.Count > 0 Then
itemCode = matchCode(0).Value
End If
I think the main issue here is that you are confusing your regular expression syntax between different languages.
In languages like Javascript, Perl, Ruby and others, you create a regular expression object by using the /regex/ notation.
In .NET, when you instantiate a Regex object, you pass it a string of the regular expression, which is delimited by quotes, not slashes. So it is of the form "regex".
So try removing the leading and trailing / from your string and see how you go.
This may not be the whole problem, but it is at least part of it.
Are you getting the whole string instead of just the 1st value? Regular expressions are greedy by default so .Net is trying to grab the largest matching string.
Try this:
Dim matchCode = Regex.Matches(txtItems.Text, "\{[^}]*\}")
Dim itemCode As String
If matchCode.Count > 0 Then
itemCode = matchCode(0).Groups(0).Value
End If
Edited: I've tried this in Linqpad and it worked.
It appears you are using a capture group.. so try matchCode(0).Groups(0).Value
Also, remove the /\ from the beginning of the pattern and remove the trailing /

RegEx : replace all Url-s that are not anchored

I'm trying to replace Urls contained inside a HTML code block the users post into an old web-app with proper anchors (<A>) for those Urls.
The problem is that Urls can be already 'anchored', that is contained in <A> elements. Those Url should not be replaced.
Example:
http://noreplace.com <- do not replace
<u>http://noreplace.com</u> <- do not replace
...http://replace.com <- replace
What would the regex to match only 'not anchored Urls' look like?
I use the following function to replace with RegEx:
Function ReplaceRegExp(strString, strPattern, strReplace)
Dim RE: Set RE = New RegExp
With RE
.Pattern = strPattern
.IgnoreCase = True
.Global = True
ReplaceRegExp = .Replace(strString, strReplace)
End With
End Function
The following non greedy regex is used to format UBB URLs. Can this regex be adapted to match only the ones I need?
' the double doublequote in the brackets is because
' double doublequoting is ASP escaping for doublequotes
strString = ReplaceRegExp(strString, "\[URL=[""]?(http|ftp|https)(:\/\/[\w\-_]+)((\.[\w\-_]+)+)([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?[""]?\](.*?)\[/URL\]", "$6")
If this really cannot be done with RegEx, what would be the solution in ASP Classic, with some code or pseudocode please? However I would really like to keep code simple with an additional regex line than add additional functions to this old code.
Thanks for your effort!
Seems like regular expressions are too complex to use for this kind of job so I went to my rusty VBScript skills and coded a function that first removes anchors and then replaces the URLs.
Here it is if somebody may need it:
Function Linkify(Text)
Dim regEx, Match, Matches, patternURLs, patternAnchors, lCount, anchorCount, replacements
patternURLs = "((http|ftp|https)(:\/\/[\w\-_]+)((\.[\w\-_]+)+)([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)"
patternAnchors = "<a[^>]*?>.*?</a>"
Set replacements=Server.CreateObject("Scripting.Dictionary")
' Create the regular expression.
Set regEx = New RegExp
regEx.Pattern = patternAnchors
regEx.IgnoreCase = True
regEx.Global = True
' Do the search for anchors.
Set Matches = regEx.Execute(Text)
lCount = 0
' Iterate through the existing anchors and replace with a placeholder
For Each Match in Matches
key = "<#" & lCount & "#>"
replacements.Add key, Match.Value
Text = Replace(Text,Cstr(Match.Value),key)
lCount = lCount+1
Next
anchorCount = lCount
' we now search for URls
regEx.Pattern = patternURLs
' create anchors from URLs
Text = regEx.Replace(Text, "$1")
' put back the originally existing anchors
For lCount = 0 To anchorCount-1
key = "<#" & lCount & "#>"
Text = Replace(Text,key, replacements.Item(key))
Next
Linkify = Text
End Function
The answer you're looking for is in negative and positive look aheads and look behinds
This article gives a pretty good overview: http://www.regular-expressions.info/lookaround.html
Here's the Regular Expression I've formulated for your case:
(?<!"|>)(ht|f)tps?://.*?(?=\s|$)
Here's some sample data I matched against:
#Matches
http://www.website.com
https://www.website.com
This is a link http://www.website.com that is not linked
This is a long link http://www.website.com/index.htm?foo=bar
ftp://www.website.com
#No Matches
<u>http://www.website.com</u>
http://website.com
http://website.com
<u>http://www.website.com</u>
ftp://www.website.com
Here's a breakdown of what the regular expression is doing:
(?<!"|>)
A negative look behind, making sure what matches next isn't preceded by a " or >
(ht|f)tps?://.*?
This looks for http, https, or ftp and anything following it. It'll also match ftps! If you want to avoid this, you could use (https?|ftp)://.*? instead
(?=\s|$)
This is a positive look ahead, which matches a space or end of line.
EXTRA CREDIT
(ht)?(?(1)tps?|ftp)://
This will match http/https/ftp but not ftps, this may be a bit overkill when you can use (https?|ftp):// but it's an awesome example of if/else in regex.
Some design issues you're going to have to work around:
Embedded URLs could be absolute or relative and may not include the protocol.
Your HTML may not have quotes around attribute values.
The character right after a URL may also be a valid URL character.
There are lots of valid URL characters these days.
If you can assume (1) absolute URLs with protocols and (2) quoted HTML attributes and (3) people will have whitespace after a URL and (4) you're sticking with supporting only basic URL characters, you can just look for URLs not preceded by a double-quote.
Here's an overly-simple example to start with (untested):
(?<!")((http|https|ftp)://[^\s<>])(?=\s|$) replaced with $1
The [^\s<>] part above is ridiculously greedy, so all of the fun will be in tweaking that to build a character set that fits the URLs your users are typing in. Your example shows a much more involved character class with \w plus a hodge-podge of other allowed characters, so you could start there if you want.