Excluding line breaks from regex capture - regex

I realise that a similar question has been asked before and answered, but the problem persists after I've tried the solution proposed in that answer.
I want to write an Excel macro to separate a multi-line string into multiple single lines, trimmed of whitespace including line breaks. This is my code:
Sub testRegexMatch()
Dim r As New VBScript_RegExp_55.regexp
Dim str As String
Dim mc As MatchCollection
r.Pattern = "[\r\n\s]*([^\r\n]+?)[\s\r\n]*$"
r.Global = True
r.MultiLine = True
str = "This is a haiku" & vbCrLf _
& "You may read it if you wish " & vbCrLf _
& " but you don't have to"
Set mc = r.Execute(str)
For Each Line In mc
Debug.Print "^" & Line & "$"
Next Line
End Sub
Expected output:
^This is a haiku$
^You may read it if you wish$
^but you don't have to$
Actual output:
^This is a haiku
$
^
You may read it if you wish
$
^
but you don't have to$
I've tried the same thing on Regex101, but this appears to show the correct captures, so it must be a quirk of VBA's regex engine.
Any ideas?

You just need to access the captured values via SubMatches():
When a regular expression is executed, zero or more submatches can result when subexpressions are enclosed in capturing parentheses. Each item in the SubMatches collection is the string found and captured by the regular expression.
Here is my demo:
Sub DemoFn()
Dim re, targetString, colMatch, objMatch
Set re = New regexp
With re
.pattern = "\s*([^\r\n]+?)\s*$"
.Global = True ' Same as /g at the online tester
.MultiLine = True ' Same as /m at regex101.com
End With
targetString = "This is a haiku " & vbLf & " You may read it if you wish " & vbLf & " but you don't have to"
Set colMatch = re.Execute(targetString)
For Each objMatch In colMatch
Debug.Print objMatch.SubMatches.Item(0) ' <== SEE HERE
Next
End Sub
It prints:
This is a haiku
You may read it if you wish
but you don't have to

Related

RegEx array / list / collection of all matches in VBA

I'm trying to use RegEx to get all instances of varying strings that exist in between a particular pair set of strings. E.g. in the following string:
"The Start. Hello. Jamie. Bye. The Middle. Hello. Sarah. Bye. The End"
I want to get a collection / array consisting of "Jamie" and "Sarah" by checking in between "Hello. " and ". Bye. "
My RegEx object is working fine and I feel I'm nearly successful:
Sub Reggie()
Dim x As String: x = "The Start. Hello. Jamie. Bye. The Middle. Hello. Sarah. Bye. The End"
Dim regEx As RegExp
Set regEx = New RegExp
Dim rPat1 As String: rPat1 = "Hello. "
Dim rPat2 As String: rPat2 = " Bye."
Dim rPat3 As String: rPat3 = ".*"
With regEx
.Global = True
.ignorecase = True
.Pattern = "(^.*" & rPat1 & ")(" & rPat3 & ")(" & rPat2 & ".*)"
.MultiLine = True
' COMMAND HERE
End With
End Sub
But the last bit COMMAND HERE I'm trying .replace(x, "$2") which gives me a string of the last instance of a match i.e. Sarah
I've tried .Execute(x) which gives me a MatchCollection object and when browsing the immediate window I see that object only has the last instance of a match.
Is what I'm requiring possible and how?
That is because .* matches as many any chars as possible and you should not match the whole string by adding .* on both ends of your regular expression.
Besides, you need to escape special chars in the regex pattern, here, . is special as it matches any char other than a line break char.
You need to fix your regex declaration like
rPat1 = "Hello\. "
rPat2 = " Bye\."
rPat3 = ".*?"`
.Pattern = rPat1 & "(" & rPat3 & ")" & rPat2
Or, to further enhance the regex, you may
Replace literal spaces with \s* (zero or more whitespaces) or \s+ (one or more whitespaces) to support any whitespace
Match any non-word chars after the captures string with \W+ or \W*.
rPat1 = "Hello\.\s*"
rPat2 = "\W+Bye\."
rPat3 = ".*?"`
.Pattern = rPat1 & "(" & rPat3 & ")" & rPat2
See the regex demo. Details:
Hello\. - Hello. string
\s* - zero or more whitespaces
(.*?) - Group 1: any zero or more chars other than line break chars as few as possible
\W+ - one or more chars other than ASCII letters/digits/_
Bye\. - Bye. string.

Regex to replace word except in comments

How can I modify my regex so that it will ignore the comments in the pattern in a language that doesn't support lookbehind?
My regex pattern is:
\b{Word}\b(?=([^"\\]*(\\.|"([^"\\]*\\.)*[^"\\]*"))*[^"]*$)
\b{Word}\b : Whole word, {word} is replaced iteratively for the vocab list
(?=([^""\](\.|""([^""\]\.)[^""\]""))[^""]$) : Don't replace anything inside of quotes
My goal is to lint variables and words so that they always have the same case. However I do not want to lint any words in a comment. (The IDE sucks and there is no other option)
Comments in this language are prefixed by an apostrophe. Sample code follows
' This is a comment
This = "Is not" ' but this is
' This is a comment, what is it's value?
Object.value = 1234 ' Set value
value = 123
Basically I want the linter to take the above code and say for the word "value" update it to:
' This is a comment
This = "Is not" ' but this is
' This is a comment, what is it's value?
Object.Value = 1234 ' Set value
Value = 123
So that all code based "Value" are updated but not anything in double quotes or in a comment or part of another word such as valueadded wouldn't be touched.
I've tried several solutions but haven't been able to get it to work.
['.*] : Not preceeding an apostrophy
(?<!\s*') : BackSearch not with any spaces with apoostrophy
(?<!\s*') : Second example seemed incorrect but this won't work as the language doesn't support backsearches
Anybody have any ideas how I can alter my pattern so that I don't edit commented variables
VBA
Sub TestSO()
Dim Code As String
Dim Expected As String
Dim Actual As String
Dim Words As Variant
Code = "item = object.value ' Put item in value" & vbNewLine & _
"some.item <> some.otheritem" & vbNewLine & _
"' This is a comment, what is it's value?" & vbNewLine & _
"Object.value = 1234 ' Set value" & vbNewLine & _
"value = 123" & vbNewLine
Expected = "Item = object.Value ' Put item in value" & vbNewLine & _
"some.Item <> some.otheritem" & vbNewLine & _
"' This is a comment, what is it's value?" & vbNewLine & _
"Object.Value = 1234 ' Set value" & vbNewLine & _
"Value = 123" & vbNewLine
Words = Array("Item", "Value")
Actual = SOLint(Words, Code)
Debug.Print Actual = Expected
Debug.Print "CODE: " & vbNewLine & Code
Debug.Print "Actual: " & vbNewLine & Actual
Debug.Print "Expected: " & vbNewLine & Expected
End Sub
Public Function SOLint(ByVal Words As Variant, ByVal FileContents As String) As String
Const NotInQuotes As String = "(?=([^""\\]*(\\.|""([^""\\]*\\.)*[^""\\]*""))*[^""]*$)"
Dim RegExp As Object
Dim Regex As String
Dim Index As Variant
Set RegExp = CreateObject("VBScript.RegExp")
With RegExp
.Global = True
.IgnoreCase = True
End With
For Each Index In Words
Regex = "[('*)]\b" & Index & "\b" & NotInQuotes
RegExp.Pattern = Regex
FileContents = RegExp.Replace(FileContents, Index)
Next Index
SOLint = FileContents
End Function
As discussed in the comments above:
((?:\".*\")|(?:'.*))|\b(v)(alue)\b
3 Parts to this regex used with alternation.
A non-capturing group for text within double quotes, as we dont need that.
A non-capturing group for text starting with single quote
Finally the string "value" is split into two parts (v) and (value) because while replacing we can use \U($2) to convert v to V and rest as is so \E$3 where \U - converts to upper case and \E - turns off the case.
\b \b - word boundaries are used to avoid any stand-alone text which is not part of setting a value.
https://regex101.com/r/mD9JeR/8

How to capture several portions of a string at once with regex?

I need to capture several strings within a longer string strText and process them. I use VBA.
strText:
Salta pax {wenn([gender]|1|orum|2|argentum)} {[firstname]} {[lastname]},
ginhox seperatum de gloria desde quativo,
dolus {[start]} tofi {[end]}, ([{n_night]}
{wenn([n_night]|1|dignus|*|digni)}), cum {[n_person]}
{wenn([n_person]|1|felix|*|semporum)}.
Quod similis beruntur: {[number]}
I'm trying to capture different portions of strText, all within the curly braces:
If there's only a string within square brackets, I'd like to capture the string:
{[firstname]} --> firstname
If there's a conditional operation (starting with wenn()), I'd like to capture the string within the square brackets plus the number-value-pairs after:
{[gender]|1|orum|2|argentum} --> gender / 1=orum / 2=argentum
I managed to define a pattern to get any one of the tasks above,
e.g. \{\[(.+?)\]\} capturing the strings within square brackets,
see this regex101
but I figure there must be a way to have a pattern that does all of the above?
I'm not sure if the following code is helpful to you. It uses the | symbol to capture both conditions.
Function extractStrings(strText As String) As MatchCollection
Dim regEx As New RegExp
Dim SubStrings As MatchCollection
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = "(\{\[)(.+?)(\]\})|(wenn\(\[)(.+?)(\])(\|)(.+?)(\|)(.+?)(\|)(.+?)(\|)(.+?)(\)\})"
End With
On Error Resume Next
Set extractStrings = regEx.Execute(strText)
If Err = 0 Then Exit Function
Set extractStrings = Nothing
End Function
Sub test()
Dim strText As String
strText = "Salta pax {wenn([gender]|1|orum|2|argentum)} {[firstname]} {[lastname]},ginhox seperatum de gloria desde quativo,dolus {[start]} tofi {[end]}, ([{n_night]} " & _
"{wenn([n_night]|1|dignus|*|digni)}), cum {[n_person]}{wenn([n_person]|1|felix|*|semporum)}.Quod similis beruntur: {[number]}"
Dim SubStrings As MatchCollection
Dim SubString As Match
Set SubStrings = extractStrings(strText)
For Each SubString In SubStrings
On Error Resume Next
If SubString.SubMatches(1) <> "" Then
Debug.Print SubString.SubMatches(1)
Else
Debug.Print "wenn(" & SubString.SubMatches(4) & "|" & SubString.SubMatches(7) & "=" & SubString.SubMatches(9) & "|" & SubString.SubMatches(11) & "=" & SubString.SubMatches(13) & ")"
End If
Next SubString
End Sub
You can iterate through all substrings with the for each loop. I am well aware, that the regex pattern is not optimal, but at least it does the trick.

Marking word as "find and replace" in Microsoft Word with RegEx

I try to highlight a word found by RegEx, and if the right to replace it with its corresponding substitute.
The code works correctly only if NOT substituted.
Probably should every time rearrange???
Sub Replace()
Dim regExp As Object
Set regExp = CreateObject("vbscript.regexp")
Dim arr As Variant
Dim arrzam As Variant
Dim i As Long
Dim choice As Integer
Dim Document As Word.Range
Set Document = ActiveDocument.Content
On Error Resume Next
'EGN
'IBAN
arr = VBA.Array("((EGN(:{0,1})){0,1})[0-9]{10}", _
"[a-zA-Z]{2}[0-9]{2}[a-zA-Z0-9]{4}[0-9]{7}([a-zA-Z0-9]?){0,16}")
arrzam = VBA.Array("[****]", _
"[IBAN]")
With regExp
For i = 0 To UBound(arr)
.Pattern = arr(i)
.Global = True
For Each Match In regExp.Execute(Document)
ActiveDocument.Range(Match.FirstIndex, Match.FirstIndex + Match.Length).Duplicate.Select
choice = MsgBox("Replace " & Chr(34) & Match.Value & Chr(34) & " with " & Chr(34) & arrzam(i) & Chr(34) & "?", _
vbYesNoCancel + vbDefaultButton1, "Replace")
If choice = vbYes Then
Document = .Replace(Document, arrzam(i))
ElseIf choice = vbCancel Then
Next
End If
Next
Next
End With
End Sub
Actually, there are several things wrong with this.
First, the each Match in Each Match is static, determined at the moment of the first loop. You're changing the document in the meantime, so each successive Match looks at an old position.
Second, you're replacing all the occurrences at one time, so there is no need to loop through them. It seems a one line, one time Replace could do the same thing.

VBS script to report AD groups - Regex pattern not working with multiple matches

Having an issue with getting a regex statement to accept two expressions.
The "re.pattern" code here works:
If UserChoice = "" Then WScript.Quit 'Detect Cancel
re.Pattern = "[^(a-z)^(0,4,5,6,7,8,9)]"
re.Global = True
re.IgnoreCase = True
if re.test( UserChoice ) then
Exit Do
End if
MsgBox "Please choose either 1, 2 or 3 ", 48, "Invalid Entry"
While the below "regex.pattern " code does not. I want to use it to format the results of a DSQUERY command where groups are collected, but I don't want any of the info after the ",", nor do i want the leading CN= that is normally collected when the following dsquery is run:
"dsquery.exe user forestroot -samid "& strInput &" | dsget user -memberof")
The string I want to format would look something like this before formatting:
CN=APP_GROUP_123,OU=Global Groups,OU=Accounts,DC=corp,DC=contoso,DC=biz
This is the result I want:
APP_GROUP_123
Set regEx = New RegExp
**regEx.Pattern = "[,.*]["CN=]"**
Result = regEx.Replace(StrLine, "")
I'm only able to get the regex to work when used individually, either
regEx.Pattern = ",."
or
regEx.Pattern = "CN="
code is nested here:
Set InputFile = FSO.OpenTextFile("Temp.txt", 1)
Set InputFile = FSO.OpenTextFile("Temp.txt", 1)
set OutPutFile = FSO.OpenTextFile(StrInput & "-Results.txt", 8, True)
do While InputFile.AtEndOfStream = False
StrLine = InputFile.ReadLine
If inStr(strLine, TaskChoice) then
Set regEx = New RegExp
regEx.Pattern = "[A-Za-z]{2}=(.+?),.*"
Result = regEx.Replace(StrLine, "")
OutputFile.write(Replace(Result,"""","")) & vbCrLf
End if
This should get you started:
str = "CN=APP_GROUP_123,OU=Global Groups,OU=Accounts,DC=corp,DC=contoso,DC=biz"
Set re = New RegExp
re.pattern = "[A-Za-z]{2}=(.+?),.*"
if re.Test(str) then
set matches = re.Execute(str)
matched_str = "Matched: " & matches(0).SubMatches(0)
Wscript.echo matched_str
else
Wscript.echo "Not a match"
end if
Output:Matched: APP_GROUP_123
The regex you need is [A-Za-z]{2}=(.+?),.*
If the match is successful, it captures everything in the parenthesis. .+? means it will match any character non-greedily up until the first comma. The ? in .+? makes the expression non-greedy. If you were to omit it, you would capture everything up to the final comma at ,DC=biz
Your regular expression "[,.*]["CN=]" doesn't work for 2 reasons:
It contains an unescaped double quote. Double quotes inside VBScript strings must be escaped by doubling them, otherwise the interpreter would interpret your expression as a string "[,.*][", followed by an (invalid) variablename CN=] (without an operator too) and the beginning of the next string (the 3rd double quote).
You misunderstand regular expression syntax. Square brackets indicate a character class. An expression [,.*] would match any single comma, period or asterisk, not a comma followed by any number of characters.
What you meant to use was an alternation, which is expressed by a pipe symbol (|), and the beginning of a string is matched by a caret (^):
regEx.Pattern = ",.*|^CN="
With that said, in your case a better approach would be using a group and replacing the whole string with just the group match:
regEx.Pattern = "^cn=(.*?),.*"
regEx.IgnoreCase = True
Result = regEx.Replace(strLine, "$1")