Specific VBA / VBScript regex related issue (MS Access) - regex

UPDATE: March 12, 2018 1:44pm CST
After creating a website called http://vbfiddle.net to implement and test #ctwheel's VBScript solution in a browser (MS IE 10+ with Security set to "Medium", instructions on that website for how to set it up for you to play with should you want -- get the code to copy & paste into vbfiddle.net from this link at jsfiddle.net: https://jsfiddle.net/hcwjhmg9/ [vbfiddle.net does not currently have a "save" feature] ), I found that #ctwheel's VBScript RegEx ran successfully, even for the 3rd example line I gave, but when #ctwheel's VBScript RegEx is used in VBScript for VBA for Microsoft Access 2016 against a record read from a database with the "same" value, the third subgroup only returns "Ray," for the 3rd example line I gave, when it should return "Ray, CFP" like it does in vbfiddle.net.
It finally occurred to me to iterate through every character of the string value returned by the database (in VBA in Microsoft Access), and compare it to an iteration of every character of the visually-equivalent string value I type directly into the code (in VBA in Microsoft Access). I get the following results:
First Name and Last Name: "G.L. (Pete) Ray, CFP"
--- 1st Text chars: "71 46 76 46 32 40 80 101 116 101 41 32 82 97 121 44"
(Read value from database, appears same as below when Debug.Print is called on it)
--- 2nd Text chars: "71 46 76 46 32 40 80 101 116 101 41 32 82 97 121 44 32 67 70 80" (Typed by keyboard into a string within the code)
'G.L. (Pete) Ray,'
strProperName>objSubMatch: "G.L."
strProperName>objSubMatch: "Pete"
strProperName>objSubMatch: "Ray,"
Matching record(s): 3 of 1132 record(s).
The RegEx I'm running is running against the "1st Text Chars" example, and returns "Ray," for the 3rd subgroup of the previously given 3rd example line: "G.L. (Pete) Ray, CFP". However, if I run the RegEx against the 2nd -- typed directly into code -- "2nd Text chars" example, the 3rd subgroup returns "Ray, CFP" as expected in VBA for Microsoft Access 2016.
I'm now using the RegEx that #ctwheels provided:
^([^(]+?)\s*\(\s*([^)]*?)\s*\)\s*(.*)
Can someone explain what's going on here? 1) Why are the characters returned from the database different from the characters returned from typing the string using a keyboard by reading and copying it visually? 2) How do I make a RegEx that works on the "1st Text Chars" sequence of characters / string return the correct 3rd subgroup: "Ray, CFP" when the value is read directly from the database?
ORIGINAL QUESTION (updated question above):
I'm having problems in VBA using Microsoft Access 2016 with Regex Engine I believe 5.5 for VBScript.
This is the regex expression I'm currently using:
"(.*)\((.*)(\))(.*)"
I'm trying to parse the strings (respectively on each new line):
Lawrence N. (Larry) Solomon
James ( Jim ) Alleman
G.L. (Pete) Ray, CFP
Into:
"Lawrence N.", "Larry", ")", "Solomon"
"James", "Jim", ")", "Alleman"
"G.L.", "Pete", ")", "Ray, CFP"
Or alternatively (and preferably) into:
"Lawrence N.", "Larry", "Solomon"
"James", "Jim", "Alleman"
"G.L.", "Pete", "Ray, CFP"
where the parts within the quotes, separated by commas, are those returned in the submatches (without quotes)
I am using the following code:
' For Proper Name (strProperName):
With objRegex
.Global = False
.MultiLine = False
.IgnoreCase = True
.Pattern = "(.*)\((.*)(\))(.*)"
'([\s|\S]*) work around to match every character?
'".*\(([^\s]*)[\s]*\(" '_
''& "[\"
'[\(][\s]*([.|^\w]*)[\s]*\)"
' "[\s]*(.*)[\s]*\("
' does same as below except matches any or no whitespace preceding any characters,
' and returns the following characters up to an opening parenthesis ("(") but excluding it,
' as the first subgroup
' "(.*)[\s]*\("
' does same as below except matches any whitespace or no whitespace at all followed by an opening parenthesis ("(")
' and returns the preceding characters as the first subgroup
' "(.*)\("
' matches all characters in a row that end with an open parenthesis, and returns all of these characters in a row
' excluding the following open parenthesis as the first subgroup
' "(.*?\(\s)"
' "[^\(]*"
' this pattern returns every character that isn't an opening parenthesis ("("), and when
' it matches an open parenthesis, it does not return it or any characters after it
' "[\(\s](.*)\)"
' this pattern extracts everything between parenthesis in a line as its first submatch
' "(?<=\().*"
' "[^[^\(]*][.*]"
' "(\(.*?\))"
' "(\(.*?\))*([^\(].*[^\)])"
End With
If objRegex.Test(strFirstNameTrimmed) Then
'Set strsMatches = objRegex.Execute(rs.Fields("First Name"))
Set strsMatches = objRegex.Execute(strFirstNameTrimmed)
Debug.Print "2:'" & strsMatches(0).Value & "'"
If strsMatches(0).SubMatches.Count > 0 Then
For Each objSubMatch In strsMatches(0).SubMatches
Debug.Print " strProperName>objSubMatch: """ & objSubMatch & """" 'Result: 000, 643, 888"
strProperName = objSubMatch
Next objSubMatch
End If
Else
strProperName = "*Not Matched*"
End If
Produces the following output in the debug window / "Immediate Window" as it's known in VBA, brought up by (Ctrl+G):
------------------------
First Name and Last Name: "Lawrence N. (Larry) Solomon"
2:'Lawrence N. (Larry)'
strProperName>objSubMatch: "Lawrence N. "
strProperName>objSubMatch: "Larry"
strProperName>objSubMatch: ")"
strProperName>objSubMatch: ""
Extracted Nick Name: "Larry"
Extracted Proper Name: ""
First Name and Last Name: "James ( Jim ) Alleman"
2:'James ( Jim )'
strProperName>objSubMatch: "James "
strProperName>objSubMatch: " Jim "
strProperName>objSubMatch: ")"
strProperName>objSubMatch: ""
Extracted Nick Name: "Jim"
Extracted Proper Name: ""
First Name and Last Name: "G.L. (Pete) Ray, CFP"
2:'G.L. (Pete) Ray,'
strProperName>objSubMatch: "G.L. "
strProperName>objSubMatch: "Pete"
strProperName>objSubMatch: ")"
strProperName>objSubMatch: " Ray,"
Extracted Nick Name: "Pete"
Extracted Proper Name: " Ray,"
Matching record(s): 3 of 1132 record(s).

See regex in use here
^([^(]+?)\s*\(\s*([^)]*?)\s*\)\s*(.*)
^ Assert position at the start of the line
([^(]+?) Capture any character except ( one or more times, but as few as possible, into capture group 1
\s* Match any number of whitespace characters
\( Match ( literally
\s* Match any number of whitespace characters
([^)]*?) Capture any character except ) one or more times, but as few as possible, into capture group 2
\s* Match any number of whitespace characters
\( Match ( literally
\s* Match any number of whitespace characters
(.*) Capture the rest of the line into capture group 3
Results in:
["Lawrence N.", "Larry", "Solomon"]
["James", "Jim", "Alleman"]
["G.L.", "Pete", "Ray, CFP"]

You should be able to avoid using Regex, if that's your thing.
I made some assumptions about the test data that the nickname is contained within "()". Other than that the code should be straightforward, I hope. If not, feel free to ask a question. There is a Test routine called Test included too.
Public Function ParseString(InputString As String) As String
On Error GoTo ErrorHandler:
Dim OutputArray As Variant
Const DoubleQuote As String = """"
'Quick exit, if () aren't found, then just return original text
If InStr(1, InputString, "(") = 0 Or InStr(1, InputString, ")") = 0 Then
ParseString = InputString
Exit Function
End If
'Replace the ) with (, then do a split
OutputArray = Split(Replace(InputString, ")", "("), "(")
'Check the array bounds and output accordingly
'If there can only ever be 3 (0 - 2) elements, then you can change this if statement
If UBound(OutputArray) = 2 Then
ParseString = DoubleQuote & Trim$(OutputArray(0)) & DoubleQuote & ", " & _
DoubleQuote & Trim$(OutputArray(1)) & DoubleQuote & ", " & _
DoubleQuote & Trim$(OutputArray(2)) & DoubleQuote
ElseIf UBound(OutputArray) = 1 Then
ParseString = DoubleQuote & Trim$(OutputArray(0)) & DoubleQuote & ", " & _
DoubleQuote & Trim$(OutputArray(1)) & DoubleQuote
Else
ParseString = DoubleQuote & Trim$(OutputArray(LBound(OutputArray))) & DoubleQuote
End If
CleanExit:
Exit Function
ErrorHandler:
ParseString = InputString
Resume CleanExit
End Function
Sub Test()
Dim Arr() As Variant: Arr = Array("Lawrence N. (Larry) Solomon", "James ( Jim ) Alleman", "G.L. (Pete) Ray, CFP")
For i = LBound(Arr) To UBound(Arr)
Debug.Print ParseString(CStr(Arr(i)))
Next
End Sub
Results
"Lawrence N.", "Larry", "Solomon"
"James", "Jim", "Alleman"
"G.L.", "Pete", "Ray, CFP"

Regex: \s*[()]\s*
Details:
\s* matches any whitespace character zero and unlimited times
[()] Match a single character present in the list ( or )
VBA code:
Dim str As String
str = "Lawrence N. (Larry) Solomon"
Set re = CreateObject("VBScript.RegExp")
re.Global = True
re.Pattern = "\s*[()]\s*"
re.MultiLine = True
Dim arr As Variant
arr = Strings.Split(re.Replace(str, vbNullChar), vbNullChar)
For Each Match In arr
Debug.Print (Match)
Next
Output:
Lawrence N.
Larry
Solomon

Related

How to match escaped group signs {&date:dd.\{mm\}.yyyy} but not {&date:dd.{mm}.yyyy} with vba and regex

I'm trying to create a pattern for finding placeholders within a string to be able to replace them with variables later. I'm stuck on a problem to find all these placeholders within a string according to my requirement.
I already found this post, but it only helped a little:
Regex match ; but not \;
Placeholders will look like this
{&var} --> Variable stored in a dictionary --> dict("var")
{$prop} --> Property of a class cls.prop read by CallByName and PropGet
{#const} --> Some constant values by name from a function
Generally I have this pattern and it works well
Dim RegEx As Object
Set RegEx = CreateObject("VBScript.RegExp")
RegEx.pattern = "\{([#\$&])([\w\.]+)\}"
For example I have this string:
"Value of foo is '{&var}' and bar is '{$prop}'"
I get 2 matches as expected
(&)(var)
($)(prop)
I also want to add a formating part like in .Net to this expression.
String.Format("This is a date: {0:dd.mm.yyyy}", DateTime.Now());
// This is a date: 05.07.2019
String.Format("This is a date, too: {0:dd.(mm).yyyy}", DateTime.Now());
// This is a date, too: 05.(07).2019
I extended the RegEx to get that optional formatting string
Dim RegEx As Object
Set RegEx = CreateObject("VBScript.RegExp")
RegEx.pattern = "\{([#\$&])([\w\.]+):{0,1}([^\}]*)\}"
RegEx.Execute("Value of foo is '{&var:DD.MM.YYYY}' and bar is '{$prop}'")
I get 2 matches as expected
(&)(var)(DD.MM.YYYY)
($)(prop)()
At this point I noticed I have to take care for escapet "{" and "}", because maybe I want to have some brackets within the formattet result.
This does not work properly, because my pattern stops after "...{MM"
RegEx.Execute("Value of foo is '{&var:DD.{MM}.YYYY}' and bar is '{$prop}'")
It would be okay to add escape signs to the text before checking the regex:
RegEx.Execute("Value of foo is '{&var:DD.\{MM\}.YYYY}' and bar is '{$prop}'")
But how can I correctly add the negative lookbehind?
And second: How does this also works for variables, that should not be resolved, even if they have the correct syntax bus the outer bracket is escaped?
RegEx.Execute("This should not match '\{&var:DD.\{MM\}.YYYY\}' but this one '{&var:DD.\{MM\}.YYYY}'")
I hope my question is not confusing and someone can help me
Update 05.07.19 at 12:50
After the great help of #wiktor-stribiżew the result is completed.
As requested i provide some example code:
Sub testRegEx()
Debug.Print FillVariablesInText(Nothing, "Date\\\\{$var01:DD.\{MM\}.YYYY}\\\\ Var:\{$nomatch\}{$var02} Double: {#const}{$var01} rest of string")
End Sub
Function FillVariablesInText(ByRef dict As Dictionary, ByVal txt As String) As String
Const c_varPattern As String = "(?:(?:^|[^\\\n])(?:\\{2})*)\{([#&\$])([\w.]+)(?:\:([^}\\]*(?:\\.[^\}\\]*)*))?(?=\})"
Dim part As String
Dim snippets As New Collection
Dim allMatches, m
Dim i As Long, j As Long, x As Long, n As Long
' Create a RegEx object and execute pattern
Dim RegEx As Object
Set RegEx = CreateObject("VBScript.RegExp")
RegEx.pattern = c_varPattern
RegEx.MultiLine = True
RegEx.Global = True
Set allMatches = RegEx.Execute(txt)
' Start at position 1 of txt
j = 1
n = 0
For Each m In allMatches
n = n + 1
Debug.Print "(" & n & "):" & m.value
Debug.Print " [0] = " & m.SubMatches(0) ' Type [&$#]
Debug.Print " [1] = " & m.SubMatches(1) ' Name
Debug.Print " [2] = " & m.SubMatches(2) ' Format
part = "{" & m.SubMatches(0)
' Get offset for pre-match-string
x = 1 ' Index to Postion at least +1
Do While Mid(m.value, x, 2) <> part
x = x + 1
Loop
' Postition in txt
i = m.FirstIndex + x
' Anything to add to result?
If i <> j Then
snippets.Add Mid(txt, j, i - j)
End If
' Next start postition (not Index!) + 1 for lookahead-positive "}"
j = m.FirstIndex + m.Length + 2
' Here comes a function get a actual value
' e.g.: snippets.Add dict(m.SubMatches(1))
' or : snippets.Add Format(dict(m.SubMatches(1)), m.SubMatches(2))
snippets.Add "<<" & m.SubMatches(0) & m.SubMatches(1) & ">>"
Next m
' Any text at the end?
If j < Len(txt) Then
snippets.Add Mid(txt, j)
End If
' Join snippets
For i = 1 To snippets.Count
FillVariablesInText = FillVariablesInText & snippets(i)
Next
End Function
The function testRegEx gives me this result and debug print:
(1):e\\\\{$var01:DD.\{MM\}.YYYY(2):}{$var02
[0] = $
[1] = var02
[2] =
(1):e\\\\{$var01:DD.\{MM\}.YYYY
[0] = $
[1] = var01
[2] = DD.\{MM\}.YYYY
(2):}{$var02
[0] = $
[1] = var02
[2] =
(3): {#const
[0] = #
[1] = const
[2] =
(4):}{$var01
[0] = $
[1] = var01
[2] =
Date\\\\<<$var01>>\\\\ Var:\{$nomatch\}<<$var02>> Double: <<#const>><<$var01>> rest of string
You may use
((?:^|[^\\])(?:\\{2})*)\{([#$&])([\w.]+)(?::([^}\\]*(?:\\.[^}\\]*)*))?}
To make sure the consecutive matches are found, too, turn the last } into a lookahead, and when extracting matches just append it to the result, or if you need the indices increment the match length by 1:
((?:^|[^\\])(?:\\{2})*)\{([#$&])([\w.]+)(?::([^}\\]*(?:\\.[^}\\]*)*))?(?=})
^^^^^
See the regex demo and regex demo #2.
Details
((?:^|[^\\])(?:\\{2})*) - Group 1 (makes sure the { that comes next is not escaped): start of string or any char but \ followed with 0 or more double backslashes
\{ - a { char
([#$&]) - Group 2: any of the three chars
([\w.]+) - Group 3: 1 or more word or dot chars
(?::([^}\\]*(?:\\.[^}\\]*)*))? - an optional sequence of : and then Group 4:
[^}\\]* - 0 or more chars other than } and \
(?:\\.[^}\\]*)* - zero or more reptitions of a \-escaped char and then 0 or more chars other than } and \
} - a } char
Welcome to the site! If you need to only match balanced escapes, you will need something more powerful. If not --- I haven't tested this, but you could try replacing [^\}]* with [^\{\}]|\\\{|\\\}. That is, match non-braces and escaped brace sequences separately. You may need to change this depending on how you want to handle backslashes in your formatting string.

Replace '-' with space if the next charcter is a letter not a digit and remove when it is at the start

I have a list of string i.e.
slist = ["-args", "-111111", "20-args", "20 - 20", "20-10", "args-deep"]
I want to remove the '-' from string where it is the first character and is followed by strings but not numbers or if before the '-' there is number/alphabet but after it is alphabets, then it should replace the '-' with space
So for the list slist I want the output as
["args", "-111111", "20 args", "20 - 20", "20-10", "args deep"]
I have tried
slist = ["-args", "-111111", "20-args", "20 - 20", "20-10", "args-deep"]
nlist = list()
for estr in slist:
nlist.append(re.sub("((^-[a-zA-Z])|([0-9]*-[a-zA-Z]))", "", estr))
print (nlist)
and i get the output
['rgs', '-111111', 'rgs', '20 - 20', '20-10', 'argseep']
You may use
nlist.append(re.sub(r"-(?=[a-zA-Z])", " ", estr).lstrip())
or
nlist.append(re.sub(r"-(?=[^\W\d_])", " ", estr).lstrip())
Result: ['args', '-111111', '20 args', '20 - 20', '20-10', 'args deep']
See the Python demo.
The -(?=[a-zA-Z]) pattern matches a hyphen before an ASCII letter (-(?=[^\W\d_]) matches a hyphen before any letter), and replaces the match with a space. Since - may be matched at the start of a string, the space may appear at that position, so .lstrip() is used to remove the space(s) there.
Here, we might just want to capture the first letter after a starting -, then replace it with that letter only, maybe with an i flag expression similar to:
^-([a-z])
DEMO
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"^-([a-z])"
test_str = ("-args\n"
"-111111\n"
"20-args\n"
"20 - 20\n"
"20-10\n"
"args-deep")
subst = "\\1"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE | re.IGNORECASE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Demo
const regex = /^-([a-z])/gmi;
const str = `-args
-111111
20-args
20 - 20
20-10
args-deep`;
const subst = `$1`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
RegEx
If this expression wasn't desired, it can be modified or changed in regex101.com.
RegEx Circuit
jex.im visualizes regular expressions:
One option could be to do 2 times a replacement. First match the hyphen at the start when there are only alphabets following:
^-(?=[a-zA-Z]+$)
Regex demo
In the replacement use an empty string.
Then capture 1 or more times an alphabet or digit in group 1, match - followed by capturing 1+ times an alphabet in group 2.
^([a-zA-Z0-9]+)-([a-zA-Z]+)$
Regex demo
In the replacement use r"\1 \2"
For example
import re
regex1 = r"^-(?=[a-zA-Z]+$)"
regex2 = r"^([a-zA-Z0-9]+)-([a-zA-Z]+)$"
slist = ["-args", "-111111", "20-args", "20 - 20", "20-10", "args-deep"]
slist = list(map(lambda s: re.sub(regex2, r"\1 \2", re.sub(regex1, "", s)), slist))
print(slist)
Result
['args', '-111111', '20 args', '20 - 20', '20-10', 'args deep']
Python demo

RegExp other patterns not working

I continue trying to perform string format matching using RegExp in VBScript & VB6. I am now trying to match a short, single-line string formatted as:
Seven characters:
a. Six alphanumeric plus one "-" OR
b. Five alphanumeric plus two "-"
Three numbers
Two letters
Literal "65"
A two-digit hex number.
Examples include 123456-789LM65F2, 4EF789-012XY65A5, A2345--789AB65D0 & 23456--890JK65D0.
The RegExp pattern ([A-Z0-9\-]{12})([65][A-F0-9]{2}) lumps (1) - (3) together and finds these OK.
However, if I try to:
c) Break (3) out w/ pattern ([A-Z0-9\-]{10})([A-Z]{2})([65][A-F0-9]{2}),
d) Break out both (2) & (3) w/ pattern ([A-Z0-9\-]{7})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2}), or
e) Tighten up (1) with alternation pattern ([A-Z0-9]{5}[-]{2}|[A-Z0-9]{6}[-]{1})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2})
it refuses to find any of them.
What am I doing wrong? Following is a VBScript that runs and checks these.
' VB Script
Main()
Function Main() ' RegEx_Format_sample.vbs
'Uses two paterns, TestPttn for full format accuracy check & SplitPttn
'to separate the two desired pieces
Dim reSet, EtchTemp, arrSplit, sTemp
Dim sBoule, sSlice, idx, TestPttn, SplitPttn, arrMatch
Dim arrPttn(3), arrItems(3), idxItem, idxPttn, Msgtemp
Set reSet = New RegExp
' reSet.IgnoreCase = True ' Not using
' reSet.Global = True ' Not using
' load test case formats to check & split
arrItems(0) = "0,6 nums + 1 '-',123456-789LM65F2"
arrItems(1) = "1,6 chars + 1 '-',4EF789-012XY65A5"
arrItems(2) = "2,5 chars + 2 '-',A2345--789AB65D0"
arrItems(3) = "3,5 nums + 2 '-',23456--890JK65D0"
SplitPttn = "([A-Z0-9]{5,6})[-]{1,2}([A-Z0-9]{9})" ' split pattern has never failed to work
' load the patterns to try
arrPttn(0) = "([A-Z0-9\-]{12})([65][A-F0-9]{2})"
arrPttn(1) = "([A-Z0-9\-]{10}[A-Z]{2})([65][A-F0-9]{2})"
arrPttn(2) = "([A-Z0-9\-]{7})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2})"
arrPttn(3) = "([A-Z0-9]{5}[-]{2}|[A-Z0-9]{6}[-]{1})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2})"
For idxPttn = 0 To 3 ' select Test pattern
TestPttn = arrPttn(idxPttn)
TestPttn = TestPttn & "[%]" ' append % "ender" char
SplitPttn = SplitPttn & "[%]" ' append % "ender" char
For idxItem = 0 To 3
reSet.Pattern = TestPttn ' set to Test pattern
sTemp = arrItems(idxItem )
arrSplit = Split(sTemp, ",") ' arrSplit is Split array
EtchTemp = arrSplit(2) & "%" ' append % "ender" char to Item sub (2) as the "phrase" under test
If reSet.Test(EtchTemp) = False Then
MsgBox("RegEx " & TestPttn & " false for " & EtchTemp & " as " & arrSplit(1) )
Else ' test OK; now switch to SplitPttn
reSet.Pattern = SplitPttn
Set arrMatch = reSet.Execute(EtchTemp) ' run Pttn as Exec this time
If arrMatch.Count > 0 then ' If test OK then Count s/b > 0
Msgtemp = ""
Msgtemp = "RegEx " & TestPttn & " TRUE for " & EtchTemp & " as " & arrSplit(1)
For idx = 0 To arrMatch.Item(0).Submatches.Count - 1
Msgtemp = Msgtemp & Chr(13) & Chr(10) & "Split segment " & idx & " as " & arrMatch.Item(0).submatches.Item(idx)
Next
MsgBox(Msgtemp)
End If ' Count OK
End If ' test OK
Next ' idxItem
Next ' idxPttn
End Function
Try this Regex:
(?:[A-Z0-9]{6}-|[A-Z0-9]{5}--)[0-9]{3}[A-Z]{2}65[0-9A-F]{2}
Click for Demo
Explanation:
(?:[A-Z0-9]{6}-|[A-Z0-9]{5}--) - matches either 6 Alphanumeric characters followed by a - or 5 Alphanumeric characters followed by a --
[0-9]{3} - matches 3 Digits
[A-Z]{2} - matches 2 Letters
65 - matches 65 literally
[0-9A-F]{2} - matches 2 HEX symbols
You can get some idea from the following code:
VBScript Code:
Option Explicit
Dim objReg, strTest
strTest = "123456-789LM65F2" 'Change the value as per your requirements. You can also store a list of values in an array and run the code in loop
set objReg = new RegExp
objReg.Global = True
objReg.IgnoreCase = True
objReg.Pattern = "(?:[A-Z0-9]{6}-|[A-Z0-9]{5}--)[0-9]{3}[A-Z]{2}65[0-9A-F]{2}"
if objReg.test(strTest) then
msgbox strTest&" matches with the Pattern"
else
msgbox strTest&" does not match with the Pattern"
end if
set objReg = Nothing
Your patterns do not work because:
([A-Z0-9\-]{12})([65][A-F0-9]{2}) - matches 12 occurrences of either an AlphaNumeric character or - followed by either 6 or 5 followed by 2 HEX characters
([A-Z0-9\-]{10}[A-Z]{2})([65][A-F0-9]{2}) - matches 10 occurrences of either an AlphaNumeric character or - followed by 2 Letters followed by either 6 or 5 followed by 2 HEX characters
([A-Z0-9\-]{7})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2}) - matches 7 occurrences of either an AlphaNumeric character or - followed by 3 digits followed by 2 Letters followed by either 6 or 5 followed by 2 HEX characters
([A-Z0-9]{5}[-]{2}|[A-Z0-9]{6}[-]{1})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2}) - matches either 5 occurrences of an AlphaNumeric character followed by -- or 6 occurrences of an Alphanumeric followed by a -. This is then followed by 3 digits followed by 2 Letters followed by either 6 or 5 followed by 2 HEX characters
Try this pattern :
(([A-Z0-9]{5}--)|([A-Z0-9]{6}-))[0-9]{3}[A-Z]{2}65[0-9A-F]{2}
Or, if the last part doesn't like the [A-F]
(([A-Z0-9]{5}--)|([A-Z0-9]{6}-))[0-9]{3}[A-Z]{2}65[0-9ABCDEF]{2}
All, tanx again for your help!!
trincot, everything in each arrItems() between the commas, incl the the "plus", is merely part of a shorthand description of each item's characteristics, such as "5 characters plus 2 dashes".
Gurman, your pttn breakdowns were helpful, but, if I read it right, the addition of the ? prefix is a "Match zero or one occurrences" and this must match exactly one occurrence. Also, my 1st pattern (matches 12) actually DID work for all my test cases.
jNevill, & JMichelB your suggestions are very close to what I ended up with.
I was "over-classing". After some tinkering, I was able to get the Test Pttn to successfully recognize these test cases by taking the [65] out of the [] in my original Alternation pattern. That is I went from ([65]) to (65) and Zammo! it worked.
Orig pattern:
([A-Z0-9]{5}[-]{2}|[A-Z0-9]{6}[-]{1})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2})
Wkg pattern:
([A-Z0-9]{5}[-]{2}|[A-Z0-9]{6}[-]{1})([0-9]{3})([A-Z]{2})(65)([A-F0-9]{2})
Oh, and I moved the
SplitPttn = SplitPttn & "[%]" ' append % "ender" char
stmt up out of the For...Next loop. That helped w/ the splitting.
T-Bone

Why is this regexp slow when the input line is long and has many spaces?

VBScript's Trim function only trims spaces. Sometimes I want to trim TABs as well. For this I've been using this custom trimSpTab function that is based on a regular expression.
Today I ran into a performance problem. The input consisted of rather long lines (several 1000 chars).
As it turns out
- the function is slow, only if the string is long AND contains many spaces
- the right-hand part of the regular expression is reponsible for the poor performance
- the run time seems quadratic to the line length (O(n^2))
So why is this line trimmed fast
" aaa xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx bbb " '10000 x's
and this one trimmed slowly
" aaa bbb " '10000 spaces
Both contain only 6 characters to be trimmed.
Can you propose a modification to my trimSpTab function?
Dim regex
Set regex = new regexp
' TEST 1 - executes in no time
' " aaa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX bbb "
t1 = Timer
character = "X"
trimTest character
MsgBox Timer-t1 & " sec",, "with '" & character & "' in the center of the string"
' TEST 2 - executes in 1 second on my machine
' " aaa bbb "
t1 = Timer
character = " "
trimTest character
MsgBox Timer-t1 & " sec",, "with '" & character & "' in the center of the string"
Sub trimTest (character)
sInput = " aaa " & String (10000, character) & " bbb "
trimmed = trimSpTab (sInput)
End Sub
Function trimSpTab (byval s)
'trims spaces & tabs
regex.Global = True
regex.Pattern = "^[ \t]+|[ \t]+$" 'trim left+right
trimSpTab = regex.Replace (s, "")
End Function
I have tried this (with regex.Global = false) but to no avail
regex.Pattern = "^[ \t]+" 'trim left
s = regex.Replace (s, "")
regex.Pattern = "[ \t]+$" 'trim right
trimSpTab = regex.Replace (s, "")
UPDATE
I've come up with this alternative in the mean time. It processes a 100 million character string is less than a second.
Function trimSpTab (byval s)
'trims spaces & tabs
regex.Pattern = "^[ \t]+"
s = strReverse (s)
s = regex.Replace (s, "")
s = strReverse (s)
s = regex.Replace (s, "")
trimSpTab = s
End Function
Solution
As mentioned in the question, your current solution is to reverse the string. However, this is not necessary, since .NET regex supports RightToLeft matching option. For the same regex, the engine will start matching from right to left instead of default behavior of matching from left to right.
Below is sample code in C#, which I hope you can adapt to VB solution (I don't know VB enough to write sample code):
input = new Regex("^[ \t]+").Replace(input, "", 1)
input = new Regex("[ \t]+$", RegexOptions.RightToLeft).Replace(input, "", 1)
Explanation
The long run time is due to the engine just trying to match [ \t]+ indiscriminately in the middle of the string and end up failing when it is not an trailing blank sequence.
The observation that the complexity is quadratic is correct.
We know that the regex engine starts matching from index 0. If there is a match, then the next attempt starts at the end of the last match. Otherwise, the next attempt starts at the (current index + 1). (Well, to simplify things, I don't mention the case where a zero-length match is found).
Below shall illustrate the farthest attempt (some is a match, some are not) of the engine matching the regex ^[ \t]+|[ \t]+$. _ is used to denote space (or tab character) for clarity.
_____ab_______________g________
^----
^
^
^--------------
^-------------
^------------
...
^
^
^-------
When there is a long sequence of spaces & tabs in the middle of the string (which will not produce a match), the engine attempts matching at every index in the long sequence of spaces & tabs. As the result, the engine ends up going through O(k2) characters on a non-matching sequence of spaces & tabs of length k.
Your evidence proves that VBScript's RegExp implementation does not optimize for the $ anchor: It spends time (backtracking?) for each of the spaces in the middle of your test string. Without doubt, that's a fact good to know.
If this causes you real world problems, you'll have to find/write a better (R)Trim function. I came up with:
Function trimString(s, p)
Dim l : l = Len(s)
If 0 = l Then
trimString = s
Exit Function
End If
Dim ps, pe
For ps = 1 To l
If 0 = Instr(p, Mid(s, ps, 1)) Then
Exit For
End If
Next
For pe = l To ps Step -1
If 0 = Instr(p, Mid(s, pe, 1)) Then
Exit For
End If
Next
trimString = Mid(s, ps, pe - ps + 1)
End Function
It surely needs testing and benchmarks for long heads or tails of white space, but I hope it gets you started.

Regex with Non-capturing Group

I am trying to understand Non-capturing groups in Regex.
If I have the following input:
He hit the ball. Then he ran. The crowd was cheering! How did he feel? I felt so energized!
If I want to extract the first word in each sentence, I was trying to use the match pattern:
^(\w+\b.*?)|[\.!\?]\s+(\w+)
That puts the desired output in the submatch.
Match $1
He He
. Then Then
. The The
! How How
? I I
But I was thinking that using non-capturing groups, I should be able to get them back in the match.
I tried:
^(?:\w+\b.*?)|(?:[\.!\?]\s+)(\w+)
and that yielded:
Match $1
He
. Then Then
. The The
! How How
? I I
and
^(?:\w+\b.*?)|(?:[.!\?]\s+)\w+
yielded:
Match
He
. Then
. The
! How
? I
What am I missing?
(I am testing my regex using RegExLib.com, but will then transfer it to VBA).
A simple example against string "foo":
(f)(o+)
Will yield $1 = 'f' and $2 = 'oo';
(?:f)(o+)
Here, $1 = 'oo' because you've explicitly said not to capture the first matching group. And there is no second matching group.
For your scenario, this feels about right:
(?:(\w+).*?[\.\?!] {2}?)
Note that the outermost group is a non-capturing group, while the inner group (the first word of the sentence) is capturing.
The following constructs a non-capturing group for the boundary condition, and captures the word after it with a capturing group.
(?:^|[.?!]\s*)(\w+)
It's not clear from youf question how you are applying the regex to the text, but your regular "pull out another until there are no more matches" loop should work.
This works and is simple:
([A-Z])\w*
VBA requires these flag settings:
Global = True 'Match all occurrences not just first
IgnoreCase = False 'First word of each sentence starts with a capital letter
Here's some additional hard-earned info: since your regex has at least one parenthesis set, you can use Submatches to pull out only the values in the parenthesis and ignore the rest - very useful. Here is the debug output of a function I use to get Submatches, run on your string:
theMatches.Count=5
Match='He'
Submatch Count=1
Submatch='H'
Match='Then'
Submatch Count=1
Submatch='T'
Match='The'
Submatch Count=1
Submatch='T'
Match='How'
Submatch Count=1
Submatch='H'
Match='I'
Submatch Count=1
Submatch='I'
T
Here's the call to my function that returned the above:
sText = "He hit the ball. Then he ran. The crowd was cheering! How did he feel? I felt so energized!"
sRegEx = "([A-Z])\w*"
Debug.Print ExecuteRegexCapture(sText, sRegEx, 2, 0) '3rd match, 1st Submatch
And here's the function:
'Returns Submatch specified by the passed zero-based indices:
'iMatch is which match you want,
'iSubmatch is the index within the match of the parenthesis
'containing the desired results.
Function ExecuteRegexCapture(sStringToSearch, sRegEx, iMatch, iSubmatch)
Dim oRegex As Object
Set oRegex = New RegExp
oRegex.Pattern = sRegEx
oRegex.Global = True 'True = find all matches, not just first
oRegex.IgnoreCase = False
oRegex.Multiline = True 'True = [\r\n] matches across line breaks, e.g. "([\r\n].*)" will match next line + anything on it
bDebug = True
ExecuteRegexCapture = ""
Set theMatches = oRegex.Execute(sStringToSearch)
If bDebug Then Debug.Print "theMatches.Count=" & theMatches.Count
For i = 0 To theMatches.Count - 1
If bDebug Then Debug.Print "Match='" & theMatches(i) & "'"
If bDebug Then Debug.Print " Submatch Count=" & theMatches(i).SubMatches.Count
For j = 0 To theMatches(i).SubMatches.Count - 1
If bDebug Then Debug.Print " Submatch='" & theMatches(i).SubMatches(j) & "'"
Next j
Next i
If bDebug Then Debug.Print ""
If iMatch < theMatches.Count Then
If iSubmatch < theMatches(iMatch).SubMatches.Count Then
ExecuteRegexCapture = theMatches(iMatch).SubMatches(iSubmatch)
End If
End If
End Function