Tokenise mathematical (infix) expression in VBA - regex

I need to tokenize a mathematical expression using VBA. I have a working solution but am looking for a more efficient way of doing it (possibly RegExp).
My current solution:
Function TokeniseTheString(str As String) As String()
Dim Operators() As String
' Array of Operators:
Operators = Split("+,-,/,*,^,<=,>=,<,>,=", ",")
' add special characters around all "(", ")" and ","
str = Replace(str, "(", Chr(1) & "(" & Chr(1))
str = Replace(str, ")", Chr(1) & ")" & Chr(1))
str = Replace(str, ",", Chr(1) & "," & Chr(1))
Dim i As Long
' add special characters around all operators
For i = LBound(Operators) To UBound(Operators)
str = Replace(str, Operators(i), Chr(1) & Operators(i) & Chr(1))
Next i
' for <= and >=, there will now be two special characters between them instead of being one token
' to change < = back to <=, for example
For i = LBound(Operators) To UBound(Operators)
If Len(Operators(i)) = 2 Then
str = Replace(str, Left(Operators(i), 1) & Chr(1) & Chr(1) & Right(Operators(i), 1), Operators(i))
End If
Next i
' if there was a "(", ")", "," or operator next to each other, there will be two special characters next to each other
Do While InStr(str, Chr(1) & Chr(1)) > 0
str = Replace(str, Chr(1) & Chr(1), Chr(1))
Loop
' Remove special character at the end of the string:
If Right(str, 1) = Chr(1) Then str = Left(str, Len(str) - 1)
TokeniseTheString = Split(str, Chr(1))
End Function
Test using this string IF(TestValue>=0,TestValue,-TestValue) gives me the desired solution.
Sub test()
Dim TokenArray() As String
TokenArray = TokeniseTheString("IF(TestValue>=0,TestValue,-TestValue)")
End Sub
I have never seen regular expressions before and tried to implement this into VBA. The problem I am having is that the RegExp object in VBA doesn't allow positive lookbehind.
I will appreciate any more efficient solution than mine above.

As suggested by #Florent B, the following function gives the same results using RegExp:
Function TokenRegex(str As String) As String()
Dim objRegEx As New RegExp
Dim strPattern As String
strPattern = "(""(?:""""|[^""])*""|[^\s()+\-\/*^<>=,]+|<=|>=|\S)\s*"
With objRegEx
.Global = True
.MultiLine = False
.IgnoreCase = True
.Pattern = strPattern
End With
str = objRegEx.Replace(str, "$1" & ChrW(-1))
If Right(str, 1) = ChrW(-1) Then str = Left(str, Len(str) - 1)
TokenRegex = Split(str, ChrW(-1))
End Function

Related

Select text strings with multiple formatting tags within

Context:
VB.NET application using htmlagility pack to handle html document.
Issue:
In a html document, I'd like to prefixe all the strings starting with # and ending with space by an url whatever formatting tags are used within.
So #sth would became http://www.anything.tld/sth
For instance:
Before:
<p>#string1</p> blablabla
<p><strong>#stri</strong>ng2</p> bliblibli
After:
<p>#string1 blablabla</p>
<p><strong>#stri</strong>ng2 bliblibli</p>
I guess i can achieve this with html agility pack but how to select the entire text string without its formatting ?
Or should i use a simple regex replace routine?
Here's my solution. I'm sure it would make some experienced developpers bleed from every hole but it actually works.
The htmlcode is in strCorpusHtmlContent
Dim matchsHashtag As MatchCollection
Dim matchHashtag As Match
Dim captureHashtag As Capture
Dim strHashtagFormatted As String
Dim strRegexPatternHashtag As String = "#([\s]*)(\w*)"
matchsHashtag = Regex.Matches(strCorpusHtmlContent, strRegexPatternHashtag)
For Each matchHashtag In matchsHashtag
For Each captureHashtag In matchHashtag.Captures
Dim strHashtagToFormat As String
Dim strHashtagValueToFormat As String
' Test if the hashtag is followed by a tag
If Mid(strCorpusHtmlContent, captureHashtag.Index + captureHashtag.Length + 1, 1) = "<" Then
strHashtagValueToFormat = captureHashtag.Value
Dim intStartPosition As Integer = captureHashtag.Index + captureHashtag.Length + 1
Dim intSpaceCharPostion As Integer = intStartPosition
Dim nextChar As Char
Dim blnInATag As Boolean = True
Do Until (nextChar = " " Or nextChar = vbCr Or nextChar = vbLf Or nextChar = vbCrLf) And blnInATag = False
nextChar = CChar(Mid(strCorpusHtmlContent, intSpaceCharPostion + 1, 1))
If nextChar = "<" Then
blnInATag = True
ElseIf nextChar = ">" Then
blnInATag = False
End If
If blnInATag = False And nextChar <> ">" And nextChar <> " " Then
strHashtagValueToFormat &= nextChar
End If
intSpaceCharPostion += 1
Loop
strHashtagToFormat = Mid(strCorpusHtmlContent, captureHashtag.Index + 1, intSpaceCharPostion - captureHashtag.Length)
Else
strHashtagToFormat = captureHashtag.Value
End If
strHashtagFormatted = "" & strHashtagToFormat & ""
strCorpusHtmlContent = Regex.Replace(strCorpusHtmlContent, strHashtagToFormat, strHashtagFormatted)
Next
Next
Before:
<p>#has<strong>hta</strong><em>g_m</em>u<span style="text-decoration: underline;">ltifortmat</span> to convert</p>
After:
<p>#has<strong>hta</strong><em>g_m</em>u<span style="text-decoration: underline;">ltiformat</span> to convert</p>

Break String into individual elements and test for type of Character - NUM - LETTER - SPECIAL - Excel VBA

I need to figure out how I can test each character in the string to see if it is a number/letter/special character.
My question is, how can I break a string and test each individual character to see if the character is a number/letter/special character
Eg:
var = 1S#
Result1 = Num
Result2 = Alpha
Result3 = Special
If you mean
escaping user input that is to be treated as a literal string within a
regular expression—that would otherwise be mistaken for a special
character.
Then you can replace it with given regular expression:
/[.*+?^${}()|[\]\\]/g
So I got it to work by combining a few different posts on SO. This code breaks the string in an array and then checks each one for num/alpha/special and has a special case for *.
Split string into array of characters?
Regex Expression to check if there are any special characters in string like(!,#<#,$,%<^< etc)
How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops
-
Sub test()
'''Special Character Section'''
Dim special_charArr() As String
Dim special_char As String
special_char = "!,#,#,$,%,^,&,*,+,/,\,;,:"
special_charArr() = Split(special_char, ",")
'''Special Character Section'''
'''Alpha Section'''
Dim regexp As Object
Set regexp = CreateObject("vbscript.regexp")
Dim strPattern As String
strPattern = "([a-z])"
With regexp
.ignoreCase = True
.Pattern = strPattern
End With
'''Alpha Section'''
Dim buff() As String
my_string = "t3s!*"
ReDim buff(Len(my_string) - 1)
For i = 1 To Len(my_string)
buff(i - 1) = Mid$(my_string, i, 1)
char = buff(i - 1)
If IsNumeric(char) = True Then
MsgBox char & " = Number"
End If
For Each Key In special_charArr
special = InStr(char, Key)
If special = 1 Then
If Key <> "*" Then
MsgBox char & " = Special NOT *"
Else
MsgBox char & " = *"
End If
End If
Next
If regexp.test(char) Then
MsgBox char & " = Alpha"
End If
Next
End Sub

Excel VBA RegEx that extracts numbers from price values in range (has commas, $ and -)

I have a field data extracted from a database which represents a range of values, but it's coming in Excel as a String format $86,000 - $162,000.
I need to extract the minimum value and the maximum value from each cell, so I need to extract the numeric portion of it, and ignore the $, - and the ,.
I've attached an image of the data I have, and the values I want to extract from it.
This is the closest pattern I got with RegEx, but I'ts not what I'm looking for.
Pattern = (\d+)(?:\.(\d{1,2}))?
Can anyone assist ?
Just wondering why Regex?
Function GetParts(priceRange As String) As Double()
Dim arr() As String
Dim parts() As Double
If InStr(1, priceRange, "-") > 0 Then
arr = Split(priceRange, "-")
ReDim parts(0 To UBound(arr))
Dim i As Long
For i = 0 To UBound(arr)
parts(i) = CDbl(Replace$(Replace$(Trim$(arr(i)), "$", ""), ",", ""))
Next i
End If
GetParts = parts
End Function
Sub test()
MsgBox GetParts("$14,000 - $1,234,567")(0) 'Minimum
End Sub
EDIT
Yet you could do this with regex to match the data string into the parts:
Function GetPartsRegEx(priceRange As String) As Variant
Dim arr() As Double
Dim pricePattern As String
pricePattern = "(\$?\d+[\,\.\d]*)"
'START EDIT
Static re As RegExp
If re Is Nothing Then
Set re = New RegExp
re.IgnoreCase = True
re.Global = True
re.Pattern = pricePattern & "\s*[\-]\s*" & pricePattern 'look for the pattern first
End If
Static nums As RegExp
If nums Is Nothing Then
Set nums = New RegExp
'to remove all non digits, except decimal point in case you have pennies
nums.Pattern = "[^0-9.]"
nums.Global = True
End If
'END EDIT
If re.test(priceRange) Then
ReDim arr(0 To 1) ' fill return array
arr(0) = CDbl(nums.Replace(re.Replace(priceRange, "$1"), ""))
arr(1) = CDbl(nums.Replace(re.Replace(priceRange, "$2"), ""))
Else
'do some error handling here
Exit Function
End If 'maybe throw error if no +ve test or
GetPartsRegEx = arr
End Function
Sub test()
MsgBox GetPartsRegEx("$1,005.45 - $1,234,567.88")(1)
End Sub
Here is quick Example Demo https://regex101.com/r/RTNlVF/1
Pattern "^\$(\d+\,\d+)\s\-\s\$(\d+\,\d+)"
Option Explicit
Private Sub Example()
Dim RegExp As New RegExp
Dim Pattern As String
Dim CelValue As String
Dim rng As Range
Dim Cel As Range
Set rng = ActiveWorkbook.Sheets("Sheet1" _
).Range("A2", Range("A9999" _
).End(xlUp))
For Each Cel In rng
DoEvents
Pattern = "^\$(\d+\,\d+)\s\-\s\$(\d+\,\d+)"
If Pattern <> "" Then
With RegExp
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = Pattern
End With
If RegExp.Test(Cel.Value) Then
' Debug.Print Cel.Value
Debug.Print RegExp.Replace(CStr(Cel), "$1")
Debug.Print RegExp.Replace(CStr(Cel), "$2")
End If
End If
Next
End Sub
Without a loop (but still no regex):
Sub Split()
With Columns("B:B")
.Replace What:="$", Replacement:=""
Application.CutCopyMode = False
.TextToColumns Destination:=Range("B1"), DataType:=xlDelimited, Other:=True, OtherChar:="-", FieldInfo:=Array(Array(1, 1), Array(2, 1))
End With
Columns("B:C").Insert Shift:=xlToRight
Columns("D:E").NumberFormat = "0"
Range("D1").FormulaR1C1 = "Min Value"
Range("E1").FormulaR1C1 = "Max Value"
With Range("D1:E1").Interior
.Pattern = xlSolid
.PatternColorIndex = xlAutomatic
.Color = 12611584
End With
With Range("D1:E1").Font
.ThemeColor = xlThemeColorDark1
.TintAndShade = 0
End With
End Sub
I made this function:
Hope it helps.
Code:
Function ExtractNumber(ByVal TextInput As String, _
Optional ByVal Position As Byte = 0, _
Optional ByVal Delimiter As String = "-") As Variant
' You can use this function in a subprocess that
' writes the values in the cells you want, or
' you can use it directly in the ouput cells
' Variables
Dim RemoveItems(2) As String
Dim Aux As Variant
' The variable RemoveItems is an array
' containing the characters you want to remove
RemoveItems(0) = "."
RemoveItems(1) = ","
RemoveItems(2) = " "
' STEP 1 - The variable Aux will store the text
' given as input
Aux = TextInput
' STEP 2 - Characters stored in the variable
' RemoveItems will be removed from Aux
For i = 0 To UBound(RemoveItems)
Aux = Replace(Aux, RemoveItems(i), "")
Next i
' STEP 3 - Once Aux is "clean", it will be
' transformed into an array containing the
' values separated by the delimiter
' As you can see at the function's header,
' Delimiter default value is "-". You can change
' it depending on the situation
Aux = Split(Aux, Delimiter)
' STEP 4 - The result of this function will be
' a numeric value. So, if the value of the
' selected position in Aux is not numeric it will
' remove the first character assuming it is a
' currency symbol.
' If something fails in the process the function
' will return "ERROR", so you can know you may
' verify the inputs or adjust this code for
' your needs.
On Error GoTo ErrHndl
If Not IsNumeric(Aux(Position)) Then
ExtractNumber = CLng(Mid(Aux(Position), 2))
Else
ExtractNumber = CLng(Aux(Position))
End If
Exit Function
ErrHndl:
ExtractNumber = "ERROR"
End Function
You can even do this with just worksheet formulas. Under certain circumstances, Excel will ignore the $ and ,. The double unary converts the returned string to a numeric value.
First Value: =--LEFT(A1,FIND("-",A1)-1)
Second Value: =--MID(A1,FIND("-",A1)+1,99)

VBA Find a string that has range of value in it with Regular Expression and replace with each value in that range

First of all, sorry for the long title. I just don't know how to put it succinctly. I am trying to do this in VBA as normal Excel will not cut it.
Basically, I have a column. Each cells may contain data in the format of something like
flat 10-14;Flat 18-19;unit 7-9;flat A-D;ABC;DEF;
What I need is to find the string that has "-" in it, and attempt to replace it with anything in between. so the above code will become
Flat 10, Flat 11; Flat 12, Flat 14;Flat 18, Flat 19;Unit 7, Unit 8, Unit 9;Flat A, Flat B, Flat C; ABC;DEF;
With the help of this article on RegExpression, I have managed to work out how to expand the bits of data with number, which I will post the code below. However, I don't know a good way to expand the data with the letter. i.e from Flat A-C to Flat A, Flat B, Flat C
My code is below, please feel free to give any pointers if you think it can be more efficient. I am very much an amateur at this. Thank you in advance.
Sub CallRegEx()
Dim r As Match
Dim mcolResults As MatchCollection
Dim strInput As String, strPattern As String
Dim test As String, StrOutput As String, prefix As String
Dim startno As Long, endno As Long
Dim myrange As Range
strPattern = "(Flat|Unit) [0-9]+-+[0-9]+"
With Worksheets("Sheet1")
lrow = .Cells(Rows.Count, 9).End(xlUp).Row
For Each x In .Range("A2:A" & lrow)
strInput = Range("A" & x.Row).Value
Set mcolResults = RegEx(strInput, strPattern, True, , True)
If Not mcolResults Is Nothing Then
StrOutput = strInput
For Each r In mcolResults
startno = Mid(r, (InStr(r, "-") - 2), 2)
endno = Mid(r, (InStr(r, "-") + 1))
prefix = Mid(r, 1, 4)
test = ""
For i = startno To endno - 1
test = test & prefix & " " & i & ","
Next i
test = test & prefix & " " & endno
'this is because I don't want the comma at the end of the last value
StrOutput = Replace(StrOutput, r, test)
Debug.Print r ' remove in production
Next r
End If
.Range("D" & x.Row).Value = StrOutput
Next x
End With
End Sub
This function below is to support the Sub above
Function RegEx(strInput As String, strPattern As String, _
Optional GlobalSearch As Boolean, Optional MultiLine As Boolean, _
Optional IgnoreCase As Boolean) As MatchCollection
Dim mcolResults As MatchCollection
Dim objRegEx As New RegExp
If strPattern <> vbNullString Then
With objRegEx
.Global = GlobalSearch
.MultiLine = MultiLine
.IgnoreCase = IgnoreCase
.Pattern = strPattern
End With
If objRegEx.test(strInput) Then
Set mcolResults = objRegEx.Execute(strInput)
Set RegEx = mcolResults
End If
End If
End Function
Letters have character codes that are ordinal (A < B < C ...) & these can be accessed via asc()/chr$() - here is one way to do it:
inputStr = "flat 10-14;Flat 18-19;unit 7-9;flat A-D;ABC;DEF;flat 6;flat T"
Dim re As RegExp: Set re = New RegExp
re.Pattern = "(flat|unit)\s+((\d+)-(\d+)|([A-Z])-([A-Z]))"
re.Global = True
re.IgnoreCase = True
Dim m As MatchCollection
Dim start As Variant, fin As Variant
Dim tokens() As String
Dim i As Long, j As Long
Dim isDigit As Boolean
tokens = Split(inputStr, ";")
For i = 0 To UBound(tokens) '// loop over tokens
Set m = re.Execute(tokens(i))
If (m.Count) Then
With m.Item(0)
start = .SubMatches(2) '// first match number/letter
isDigit = Not IsEmpty(start) '// is letter or number?
If (isDigit) Then '// number
fin = .SubMatches(3)
Else '// letter captured as char code
start = Asc(.SubMatches(4))
fin = Asc(.SubMatches(5))
End If
tokens(i) = ""
'// loop over items
For j = start To fin
tokens(i) = tokens(i) & .SubMatches(0) & " " & IIf(isDigit, j, Chr$(j)) & ";"
Next
End With
ElseIf i <> UBound(tokens) Then tokens(i) = tokens(i) & ";"
End If
Next
Debug.Print Join(tokens, "")
flat 10;flat 11;flat 12;flat 13;flat 14;Flat 18;Flat 19;unit 7;unit 8;unit 9;flat A;flat B;flat C;flat D;ABC;DEF;flat 6;flat T

Parse a String in Excel Vba

I have a macro that send an XMLHTTP request to a server and it gets as response a plain text string, not a JSON format string or other standard formats (at least for what I know).
I would like to parse the output string in order to access the data in an structured approach in the same fashion as the parseJson subroutine in this link
My problem is I am not good with regular expressions and I am not able to modify the routine for my needs.
The string that I need to parse has the following structure:
The string is a single line
Each single parameter is defined by its parameter name the equal simbol, its value and ending with; "NID=3;" or "SID=Test;"
Parameter can be collected in "structures" starts and end with the symbol | and they are identified with their name followed by ; such as |STEST;NID=3;SID=Test;|
A structure can contain also other structures
An example of a output string is the following
|KC;|AD;PE=5;PF=3;|CD;PE=5;HP=test;|CD;PE=3;HP=abc;|
In this case there is a macro structure KC which contains a structure AD. The structure AD is composed by the parameters PE, PF and 2 structures CD. And finaly the structures CD have the parameters PE and HP
So I would like to parse the string to obtain an Object/Dictionary that reflects this structure, can you help me?
Adds after the first answers
Hi all, thank you for your help, but I think I should make more clear the output that I would like to get.
For the example string that I have, I would like to have an object with the following structure:
<KC>
<AD>
<PE>5</PE>
<PF>3</PF>
<CD>
<PE>5</PE>
<HP>test</HP>
</CD>
<CD>
<PE>3</PE>
<HP>abc</HP>
</CD>
</AD>
</KC>
So I started to wrote a possible working code base on some hint from #Nvj answer and the answer in this link
Option Explicit
Option Base 1
Sub Test()
Dim strContent As String
Dim strState As String
Dim varOutput As Variant
strContent = "|KC;|AD;PE=5;PF=3;|CD;PE=5;HP=test;|CD;PE=3;HP=abc;|"
Call ParseString(strContent, varOutput, strState)
End Sub
Sub ParseString(ByVal strContent As String, varOutput As Variant, strState As String)
' strContent - source string
' varOutput - created object or array to be returned as result
' strState - Object|Array|Error depending on processing to be returned as state
Dim objTokens As Object
Dim lngTokenId As Long
Dim objRegEx As Object
Dim bMatched As Boolean
Set objTokens = CreateObject("Scripting.Dictionary")
lngTokenId = 0
Set objRegEx = CreateObject("VBScript.RegExp")
With objRegEx
.Global = True
.MultiLine = True
.IgnoreCase = True
.Pattern = "\|[A-Z]{2};" 'Pattern for the name of structures
Tokenize objTokens, objRegEx, strContent, lngTokenId, bMatched, "str"
.Pattern = "[A-Z]{2}=[^\|=;]+;" 'Pattern for parameters name and values
Tokenize objTokens, objRegEx, strContent, lngTokenId, bMatched, "par"
End With
End Sub
Sub Tokenize(objTokens, objRegEx, strContent, lngTokenId, bMatched, strType)
Dim strKey As String
Dim strKeyPar As String
Dim strKeyVal As String
Dim strWork As String
Dim strPar As String
Dim strVal As String
Dim strLevel As String
Dim strRes As String
Dim lngCopyIndex As Long
Dim objMatch As Object
strRes = ""
lngCopyIndex = 1
With objRegEx
For Each objMatch In .Execute(strContent)
If strType = "str" Then
bMatched = True
With objMatch
strWork = Replace(.Value, "|", "")
strWork = Replace(strWork, ";", "")
strLevel = get_Level(strWork)
strKey = "<" & lngTokenId & strLevel & strType & ">"
objTokens(strKey) = strWork
strRes = strRes & Mid(strContent, lngCopyIndex, .FirstIndex - lngCopyIndex + 1) & strKey
lngCopyIndex = .FirstIndex + .Length + 1
End With
lngTokenId = lngTokenId + 1
ElseIf strType = "par" Then
strKeyPar = "<" & lngTokenId & "par>"
strKeyVal = "<" & lngTokenId & "val>"
strKey = strKeyPar & strKeyVal
bMatched = True
With objMatch
strWork = Replace(.Value, ";", "")
strPar = Split(strWork, "=")(0)
strVal = Split(strWork, "=")(1)
objTokens(strKeyPar) = strPar
objTokens(strKeyVal) = strVal
strRes = strRes & Mid(strContent, lngCopyIndex, .FirstIndex - lngCopyIndex + 1) & strKey
lngCopyIndex = .FirstIndex + .Length + 1
End With
lngTokenId = lngTokenId + 2
End If
Next
strContent = strRes & Mid(strContent, lngCopyIndex, Len(strContent) - lngCopyIndex + 1)
End With
End Sub
Function get_Level(strInput As String) As String
Select Case strInput
Case "KC"
get_Level = "L1"
Case "AD"
get_Level = "L2"
Case "CD"
get_Level = "L3"
Case Else
MsgBox ("Error")
End
End Select
End Function
This function creates a dictionary with an item for each structure name, parameter name and parameter value as shown in the figure
Thanks to the function get_Level the items associated to structures have a level that should help to preserve the original hierarchy of the data.
So what I am missing is a function to create an object that has the original structure of the input string. This is what the Retrieve function do in this answer link, but I do not know how to adapt it to my case
This looks like a simple nested delimited string. A couple of Split() functions will do the trick:
Option Explicit
Function parseString(str As String) As Collection
Dim a1() As String, i1 As Long, c1 As Collection
Dim a2() As String, i2 As Long, c2 As Collection
Dim a3() As String
a1 = Split(str, "|")
Set c1 = New Collection
For i1 = LBound(a1) To UBound(a1)
If a1(i1) <> "" Then
Set c2 = New Collection
a2 = Split(a1(i1), ";")
For i2 = LBound(a2) To UBound(a2)
If a2(i2) <> "" Then
a3 = Split(a2(i2), "=")
If UBound(a3) > 0 Then
c2.Add a3(1), a3(0)
ElseIf UBound(a3) = 0 Then
c2.Add a3(0)
End If
End If
Next i2
c1.Add c2
End If
Next i1
Set parseString = c1
End Function
Sub testParseString()
Dim c As Collection
Set c = parseString("|KC;|AD;PE=5;PF=3;|CD;PE=5;HP=test;|CD;PE=3;HP=abc;|")
Debug.Assert c(1)(1) = "KC"
Debug.Assert c(2)("PE") = "5"
Debug.Assert c(3)(1) = "CD"
Debug.Assert c(4)("HP") = "abc"
Debug.Assert c(4)(3) = "abc"
End Sub
Note that you can address values by both, index and key (if key existed in the input). If key was not provided you can only access the value by its index. You can also iterate collection recursively to get all the values in a tree structure.
Food for thought: since your structures may have repeated names (in your case "CD" structure happens twice) Collections / Dictionaries would find it problematic to store this elegantly (due to key collisions). Another good way to approach this is to create an XML structure with DOMDocument and use XPath to access its elements. See Program with DOM in Visual Basic
UPDATE: I've added XML example below as well. Have a look.
Here is another take on your string parsing issue using DOMDocument XML parser. You need to include Microsoft XML, v.6.0 in your VBA references.
Function parseStringToDom(str As String) As DOMDocument60
Dim a1() As String, i1 As Long
Dim a2() As String, i2 As Long
Dim a3() As String
Dim dom As DOMDocument60
Dim rt As IXMLDOMNode
Dim nd As IXMLDOMNode
Set dom = New DOMDocument60
dom.async = False
dom.validateOnParse = False
dom.resolveExternals = False
dom.preserveWhiteSpace = True
Set rt = dom.createElement("root")
dom.appendChild rt
a1 = Split(str, "|")
For i1 = LBound(a1) To UBound(a1)
If a1(i1) <> "" Then
a2 = Split(a1(i1), ";")
Set nd = dom.createElement(a2(0))
For i2 = LBound(a2) To UBound(a2)
If a2(i2) <> "" Then
a3 = Split(a2(i2), "=")
If UBound(a3) > 0 Then
nd.appendChild dom.createElement(a3(0))
nd.LastChild.Text = a3(1)
End If
End If
Next i2
rt.appendChild nd
End If
Next i1
Set parseStringToDom = dom
End Function
Sub testParseStringToDom()
Dim dom As DOMDocument60
Set dom = parseStringToDom("|KC;|AD;PE=5;PF=3;|CD;PE=5;HP=test;|CD;PE=3;HP=abc;|")
Debug.Assert Not dom.SelectSingleNode("/root/KC") Is Nothing
Debug.Assert dom.SelectSingleNode("/root/AD/PE").Text = "5"
Debug.Assert dom.SelectSingleNode("/root/CD[1]/HP").Text = "test"
Debug.Assert dom.SelectSingleNode("/root/CD[2]/HP").Text = "abc"
Debug.Print dom.XML
End Sub
As you can see this converts your text into an XML DOM document preserving all the structures and allowing for duplicates in naming. You can then use XPath to access any node or value. This can also be extended to have more nesting levels and further structures.
This is the XML document it creates behind the scenes:
<root>
<KC/>
<AD>
<PE>5</PE>
<PF>3</PF>
</AD>
<CD>
<PE>5</PE>
<HP>test</HP>
</CD>
<CD>
<PE>3</PE>
<HP>abc</HP>
</CD>
</root>
I've started to write a parser in VBA for the string structure specified by you, and it's not complete, but I'll post it anyways. Maybe you can pick up some ideas from it.
Sub ParseString()
Dim str As String
str = "|KC;|AD;PE=5;PF=3;|CD;PE=5;HP=test;|CD;PE=3;HP=abc;|"
' Declare an object dictionary
' Make a reference to Microsoft Scripting Runtime in order for this to work
Dim dict As New Dictionary
' If the bars are present in the first and last character of the string, replace them
str = Replace(str, "|", "", 1, 1)
If (Mid(str, Len(str), 1) = "|") Then
str = Mid(str, 1, Len(str) - 1)
End If
' Split the string by bars
Dim substring_array() As String
substring_array = Split(str, "|")
' Declare a regex object
' Check the reference to Microsoft VBScript Regular Expressions 5.5 in order for this to work
Dim regex As New RegExp
With regex
.Global = True
.IgnoreCase = True
.MultiLine = True
End With
' Object to store the regex matches
Dim matches As MatchCollection
Dim param_name_matches As MatchCollection
Dim parameter_value_matches As MatchCollection
' Define some regex patterns
pattern_for_structure_name = "^[^=;]+;"
pattern_for_parameters = "[^=;]+=[^=;]+;"
pattern_for_parameter_name = "[^=;]="
pattern_for_parameter_val = "[^=;];"
' Loop through the elements of the array
Dim i As Integer
For i = 0 To UBound(substring_array) - LBound(substring_array)
' Get the array element in a string
str1 = substring_array(i)
' Check if it contains a structure name
regex.Pattern = pattern_for_structure_name
Set matches = regex.Execute(str1)
If matches.Count = 0 Then
' This substring does not contain a structure name
' Check if it contains parameters
regex.Pattern = pattern_for_parameter
Set matches = regex.Execute(matches(0).Value)
If matches.Count = 0 Then
' There are no parameters as well as no structure name
' This means the string had || - invalid string
MsgBox ("Invalid string")
Else
' The string contains parameter names
' Add each parameter name to the dictionary
Dim my_match As match
For Each my_match In matches
' Get the name of the parameter
regex.Pattern = pattern_for_parameter_name
Set parameter_name_matches = regex.Execute(my_match.Value)
' Check if the above returned any matches
If parameter_name_matches.Count = 1 Then
' Remove = sign from the parameter name
parameter_name = Replace(parameter_name_matches(0).Value, "=", "")
' Get the value of the parameter
regex.Pattern = pattern_for_parameter_value
Set parameter_value_matches = regex.Execute(my_match.Value)
' Check if the above returned any matches
If parameter_value_matches.Count = 1 Then
' Get the value
parameter_value = Replace(parameter_value_matches(0).Value, ";", "")
' Add the parameter name and value as a key pair to the Dictionary object
dict.Item(parameter_name) = parameter_value
Else
' Number of matches is either 0 or greater than 1 - in both cases the string is invalid
MsgBox ("Invalid string")
End If
Else
' Parameter name did not match - invalid string
MsgBox ("Invalid string")
End If
Next
End If
ElseIf matches.Count = 1 Then
' This substring contains a single structure name
' Check if it has parameter names
Else
' This substring contains more than one structure name - the original string is invalid
MsgBox ("Invalid string")
End If
Next i
End Sub