Match shortest option - regex

I'm trying to use Outlook 2013 VBA to modify an email body by pulling out and replacing a < span> section. However, with multiple spans, I'm having trouble forcing the regex to only pick up one span.
Based on some other searches, I'm trying to use negative lookahead, but failing at it.
Result from below is: <span><span style = blah blah>Tags: test, test2</span>
Desired result is: <span style = blah blah>Tags: test, test2</span>
Code for test module:
Sub regextest()
Dim regex As New RegExp
Dim testStr As String
testStr = "a<span><span style=blah blah>Tags: test, test2</span></span>"
regex.pattern = "<span.*?(?:(span)).*?Tags:.*?</span>"
Set matches = regex.Execute(testStr)
For Each x In matches
Debug.Print x 'Result: <span><span style = blah blah>Tags: test, test2</span>
Next
End Sub
Thank you!

Wiktor's answer in comments above works for my purposes:
<span\b[^<]*>[^<]*Tags:[^<]*</span>
This works as long as there are no '<' between the two span ends. Not really a lookahead, but it's good enough for what I'm doing and very simple.
Thanks Wiktor!

Related

vba regex: how to extract exact recurring match at start, between, at the end of a string

I'm trying with no luck to extract a recurring word inside a string using RegEx in Excel VBA.
Following an example:
I'm trying with no luck to extract a recurring word inside a string using RegEx in Excel VBA.
Following an example:
Sub RegExTest()
 Dim re As Object
 Dim el As Object
 Const strText As String = "Fld,Fld,Fld,Fld,Fld,aFld1,bFld,cFld,Fld"
 Debug.Print strText
 With CreateObject("VBScript.RegExp")
  .Global = True
  .MultiLine = False
  .IgnoreCase = False
  .pattern = "(^Fld\,|\,Fld\,|\,Fld$)"
  If .Test(strText) Then
   Set re = .Execute(strText)
  End If
 End With
 For Each el In re
  Debug.Print el
 Next
End Sub
Result:
Fld,Fld,Fld,Fld,Fld,aFld1,bFld,cFld,Fld
Fld,
,Fld,
,Fld,
,Fld
The result that I get (4 elements) is not what I expect (6 elements).
I'm sure it is about a wrong pattern definition.
Can someone please help with the correct pattern?
Thanks in advance
The problem here is that your matches are overlapping. By that I mean the comma in Fld\, is already matched, so your second Fld won't match \,Fld\,
If you double up your commas you can see that you have the appropriate number of matches
The solution here is to use lookaheads to capture your matches. If you absolutely need the trailing commas for some reason, just append them to the relevant matches.

Using regex to find paragraphs in VBA Excel

I am trying to use regex to 'extract' paragraphs in a document. Each paragraph is preceded and followed by a '-' on separate line and each paragraph starts with a number.
For example
-
1. This is a paragraph
It may go over multiple lines
-
Ideally, I would like to not include the '-', but it doesn't really matter as I will be placing it in a string and running another regex against it (One that I know works)
The code I am trying to use is basically as follows
Dim matchPara as Object
Dim regex as Object
Dim theMatch as Object
Dim matches as Object
Dim fileName as String
Dim fileNo as Integer
Dim document as String
matchPara = "-?(\d.*?)?-"
Set regex = CreateObject("VBScript.RegExp")
regex.Pattern = matchPara
regex.Global = True
regex.Multiline = True
fileName = "C:\file.txt"
fileNo = FreeFile
Open fileName For Input As #fileNo
document = Input$(LOF(fileNo), fileNo)
set matches = regex.Execute(document)
For Each theMatch in matches
MsgBox(theMatch.Value)
Next theMatch
Close #fileNo
I have tested this regex on regex101 and it appeared to do what I wanted. I have also tested it without the grouping
-?\d.*?-
However when I run the code the theMatch.Value only ever contains a single '-'. After some messing around with the regex I got it to display the first line of text but never any more than the first line.
I have checked the length of theMatch.Value with:
MsgBox(len(theMatch.Value))
and placed the contents of theMatch.Value in a cell on the worksheet to see if It was cutting off in the message box, but both theories were proved wrong.
I am at a complete loss now and I am beginning to suspect it is possibly a VBA thing and not a regex thing. There is no requirement to use regex, I just assumed it would be the easiest thing to do.
The paragraphs contain data that I am trying to extract. So the idea was to regex each paragraph out place that in a string then run other regex to get the information that I need. Some paragraphs wont contain the data that I need so the idea was to loop through each individual paragraph and then error handle better if the data I need wasn't in that paragraph (ie get what I can and drop the rest with an error message)
Here is a screenshot:
This simple approach does not use Regex. It assumes the data is in column A and the paragraphs are placed in column B:
Sub paragraph_no_regex()
Dim s As String
Dim ary
With Application.WorksheetFunction
s = .TextJoin(" ", False, Columns(1).SpecialCells(2))
End With
ary = Split(s, "-")
i = 1
For Each a In ary
Cells(i, 2) = a
i = i + 1
Next a
End Sub
Sub F()
Dim re As New RegExp
Dim sMatch As String
Dim document As String
re.Pattern = "-\n((.|\n)+?)\n-"
'Getting document
document = ...
sMatch = re.Execute(document)(0).SubMatches(0)
End Sub
If you need dashes -, then just include them into capture group (the outer parenthesis).
This RegEx matches your description and successfully extracts paragraphs (as tested on regex101.com):
matchPara = "-\n\d+\.\s*((?:.|\n)+?)\s*\n-"
It needs the 'global' flag but not the 'multiline' flag. Instead, the end-of-line token is matched in the regex. The main point is that the innermost matching group will match any character including the end-of-line (given as an alternative) but does so in a non-greedy way ("+?"). It doesn't care about word boundaries as this is not necessary here. Also, "-" is not a special character where used in the regex so it doesn't have to be escaped.
As added benefit leading and trailing whitespace is cut off ("\s*" outside the group).

basic4android - regex match with url

I have the following regex pattern for testing a url and it works fine with all online regex testers including b4a's original regex tester(https://b4x.com:51041/regex_ws/index.html) but not in code!!
Sub Validate(Url As String)As String
Dim Pattern As String=$"^(https:\/\/t\.me\/|https:\/\/telegram\.me\/)[a-z0-9_]{3,15}[a-z0-9]$"$
Dim matcher1 As Matcher
matcher1 = Regex.Matcher(Url,Pattern)
Return matcher1.Find
End Sub
And my Url is
Https:// telegram . me (+ something like 'myChannel' with no spaces ofcurse,its just stacks's editor that won't allow tg link so if u wanted to check remove the spaces)
always returns false at all forms
tnx to #bulbus the solution for anyone that may face this problem is:
Sub Validate(Url As String)As String
Dim Pattern As String=$"^(https:\/\/t\.me\/|https:\/\/telegram\.me\/)[a-zA-Z0-9_]{3,15}[a-zA-Z0-9]$"$
Dim matcher1 As Matcher
matcher1= Regex.Matcher2(Pattern,Regex.MULTILINE,Url)
Return matcher1.Find
End Sub
Option 1
Use
matcher1 = Regex.Matcher(Url,Pattern,RegexOptions.IgnoreCase)
OR
Option 2
Use
Dim Pattern As String=$"^(https:\/\/t\.me\/|https:\/\/telegram\.me\/)[a-zA-Z0-9_]{3,15}[a-zA-Z0-9]$"$
Instead of
Dim Pattern As String=$"^(https:\/\/t\.me\/|https:\/\/telegram\.me\/)[a-z0-9_]{3,15}[a-z0-9]$"$
I hope both solutions are self explanatory!
EDIT
After OP accepted the answer, just a little bit of explanation. The LineBegin ^ and LineEnd $ identifiers are recognised only in MULTILINE mode otherwise they are ignored.

Regular Expressions Vbscript

I am fiddling with regular expressions to shorten a string splitting routine I have been using.
I have a string for my cart that is submitted to an asp script as follows:
addnothing|-1, addRST115400112*2xl|0, addnothing|-1, addnothing|-1, addRST115400115*xs|0, addnothing|-1
I want to be able to extract the two entries that represent two stock items:
addRST115400112*2xl|0
addRST115400115*xs|0
I have managed to get this bit of code to work but I am unsure about the pattern I am using:
add[^n](.*)\*(.*)\|[0-9],
This returns this:
addRST115400112*2xl|0, addnothing|-1, addnothing|-1, addRST115400115*xs|0,
but I only want it to return :
addRST115400112*2xl|0
addRST115400115*xs|0
Can anybody point me in the right direction please?
You were matching it greedily (.* eats as much as it can so in your case it ends up eating till the last \|[0-9] i.e |0)
You should match it lazily by using .*? instead of .*
So your regex should be
add(?!nothing)(.*?)\*(.*?)\|\d
\d is similar to [0-9]
(?!nothing) is just a check..it doesn't match or consume anything..it's better then [^n] cuz it's more reliable,expressive and doesnt eat anything
Trying to keep the .Pattern simple (this is VBScript!) and to make tinkering with it easier (what really singles out stock items is by no means clear):
Dim sInp : sInp = "addnothing|-1, addRST115400112*2xl|0, addnothing|-1, addnothing|-1, addRST115400115*xs|0, addnothing|-1"
Dim reCut : Set reCut = New RegExp
reCut.Global = True
reCut.Pattern = "addR[^|]+\|\d"
Dim oMTS : Set oMTS = reCut.Execute(sInp)
If 2 = oMTS.Count Then
WScript.Echo "Success:", Join(Array(oMTS(0).Value, oMTS(1).Value))
Else
WScript.Echo "Bingo:", reCut.Pattern
End If
output:
Success: addRST115400112*2xl|0 addRST115400115*xs|0

using classic asp for regular expression

We have some Classic asp sites, and i'm working on them a lil' bit, and I was wondering how can I write a regular expression check, and extract the matched expression:
the expression I have is in the script's name
so Let's say this
Response.Write Request.ServerVariables("SCRIPT_NAME")
Prints out:
review_blabla.asp
review_foo.asp
review_bar.asp
How can I get the blabla, foo and bar from there?
Thanks.
Whilst Yots' answer is almost certainly correct, you can achieve the result you are looking for with a lot less code and somewhat more clearly:
'A handy function i keep lying around for RegEx matches'
Function RegExResults(strTarget, strPattern)
Set regEx = New RegExp
regEx.Pattern = strPattern
regEx.Global = true
Set RegExResults = regEx.Execute(strTarget)
Set regEx = Nothing
End Function
'Pass the original string and pattern into the function and get a collection object back'
Set arrResults = RegExResults(Request.ServerVariables("SCRIPT_NAME"), "review_(.*?)\.asp")
'In your pattern the answer is the first group, so all you need is'
For each result in arrResults
Response.Write(result.Submatches(0))
Next
Set arrResults = Nothing
Additionally, I have yet to find a better RegEx playground than Regexr, it's brilliant for trying out your regex patterns before diving into code.
You have to use the Submatches Collection from the Match Object to get your data out of the review_(.*?)\.asp Pattern
Function getScriptNamePart(scriptname)
dim RegEx : Set RegEx = New RegExp
dim result : result = ""
With RegEx
.Pattern = "review_(.*?)\.asp"
.IgnoreCase = True
.Global = True
End With
Dim Match, Submatch
dim Matches : Set Matches = RegEx.Execute(scriptname)
dim SubMatches
For Each Match in Matches
For Each Submatch in Match.SubMatches
result = Submatch
Exit For
Next
Exit For
Next
Set Matches = Nothing
Set SubMatches = Nothing
Set Match = Nothing
Set RegEx = Nothing
getScriptNamePart = result
End Function
You can do
review_(.*?)\.asp
See it here on Regexr
You will then find your result in capture group 1.
You can use RegExp object to do so.
Your code gonna be like this:
Set RegularExpressionObject = New RegExp
RegularExpressionObject.Pattern = "review_(.*)\.asp"
matches = RegularExpressionObject.Execute("review_blabla.asp")
Sorry, I can't test code below right now.
Check out usage at MSDN http://msdn.microsoft.com/en-us/library/ms974570.aspx