How to split multiple UPPERCASE/delimiter/text using regex? (VBA) - regex

I've got 2k+ records with string followyng rule (LOCATION I UPPERCASE - text) x several times, like this:
I- TRZON - Fragmenty błony śluzowej trzonu żołądka w stanie przewleklego
powierzchownego (++) aktywnego (++) zapalenia. W barwieniu Warthin-Starry
nie stwierdza się bakterii odpowiadających Helicobacter pylori. II-ANTRUM +
KĄT - Fragmenty błony śluzowej części odźwiernikowej żołądka w stanie
przewlekłego głębokiego zapalenia (+++). W barwieniu Warthin-Starry nie
stwierdza się bakterii odpowiadajacych Helicobacter pylori.
Which I'm trying to split as follows using regex:
Location - I- TRZON
Text Fragmenty błony śluzowej trzonu żołądka w stanie przewleklego powierzchownego (++) aktywnego (++) zapalenia. W barwieniu Warthin-Starry nie stwierdza się bakterii odpowiadających Helicobacter pylori.
Location II- ANTRUM + KĄT
Text Fragmenty błony śluzowej części odźwiernikowej żołądka w stanie przewlekłego głębokiego zapalenia (+++). W barwieniu Warthin-Starry nie stwierdza się bakterii odpowiadajacych Helicobacter pylori.
So far I managed to do this by creating something like this
([A-ZŻŹĆĄŚĘŁÓŃ\s,+\-0-9]*)[\s]?-+?(.*[^A-ZŻŹĆĄŚĘŁÓŃ\s,+\-0-9]) ([A-ZŻŹĆĄŚĘŁÓŃ\s,+\-0-9]+)*[\s]?-+?(.*)
But obviously it cannot manage those strings, where one or three pairs of location and text are possible. The main problems I encountered are hyphens used in text (see - Warthin-Starry).
If I try something more elegant, like
([A-ZŻŹĆŃĄŚŁĘÓ]+[\s-\+,]*?)-(.*)
It obviously matches only the word before the first hyphen into the first group, and everything else into next.
To sum up: how to translate into regex something like: match, splitting into two groups: 1) UPPERCASE text with any other signs (no lowercase), followed by 2) text, that is as long as you encounter another UPPERCASE text.
I must admit that I'm fairly new to regex, but I searched for a few days and nothing seems to work universally (and it's only the beginning of extracting data from this string...)

I`m not sure how you can do this with RegEx, i have a hard time myself to get my head around that syntax.
However, I would probably just use the DATA/Text To Columns, split with hyphen, and just concatenate back together the extra splits caused by hyphens in text.
If is not just an one off processing, you can always use VBA as well, something like:
Sub TextToColumns()
Dim ws As Worksheet: Set ws = ActiveSheet
Dim lRow As Long, sndHyphen As Long, R As Long
lRow = ws.Cells(1, 1).End(xlDown).Row
For R = 1 To lRow 'Iterate through all rows containing this data
sndHyphen = InStr(InStr(ws.Cells(R, 1), "-") + 1, ws.Cells(R, 1), "-") 'Get the hyphens positions
ws.Cells(R, 2) = Left(ws.Cells(R, 1), sndHyphen - 2) 'Get the data before the second hyphen
ws.Cells(R, 3) = Mid(ws.Cells(R, 1), sndHyphen + 2) 'Get the data after the second hyphen
Next R
End Sub

Thank you for your input. I finally managed to do this using two subs:
Sub locfinder()
Dim myregexp As RegExp
Set myregexp = New RegExp
Dim myMatches As Variant
Dim myMatch As Variant
Dim str As String
Dim i, j As Integer
Dim endrow As Integer
Sheets("dane").Activate
endrow = LastRow
Dim rozp1, rozp2 As String
For i = 1 To endrow
str = Sheets("Dane").Cells(i, 10).Value
myregexp.Global = True
myregexp.Pattern = "([A-ZŻŹĆĄŚĘŁÓŃ]+[\s,+\-0-9]*[A-ZŻŹĆĄŚĘŁÓŃ]*[\s,+\-0-9]*[A-ZŻŹĆĄŚĘŁÓŃ]*[\s,+\-0-9]*|Trzon|Antrum)\s?-"
If Not str = "" Then
Set myMatches = myregexp.Execute(str)
j = 1
For Each myMatch In myMatches
If myMatch.Value <> "" Then
Sheets("Dane").Cells(i, j + 10).Value = Trim(myMatch.SubMatches(0))
j = j + 1
End If
Next
End If
Next i
End Sub
Then extracted diagnoses using
Sub rozpfinder()
Dim myregexp As RegExp
Set myregexp = New RegExp
Dim myMatches As Variant
Dim myMatch As Variant
Dim str As String
Dim i, j As Integer
Dim endrow As Integer
Sheets("dane").Activate
endrow = LastRow
Dim rozp, loc As Collection
Dim splitted() As String
Dim rozpoznanie, lokalizacja
Dim wb As Workbook
Dim ws As Worksheet
Set wb = ThisWorkbook
Set ws = wb.Worksheets("Dane")
For i = 1 To endrow
str = ws.Cells(i, 10).Value
Set loc = New Collection
Set rozp = New Collection
For j = 1 To 2
If ws.Cells(i, 10 + j) <> "" Then
loc.Add ws.Cells(i, 10 + j).Value
End If
Next j
For Each lokalizacja In loc
If lokalizacja <> "I" Then
str = Replace(str, lokalizacja, "xxx")
Else
lokalizacja = "I-"
str = Replace(str, lokalizacja, "xxx-")
End If
Next lokalizacja
splitted = split(str, "xxx")
For j = 0 To UBound(splitted)
If splitted(j) <> "" Then
myregexp.Pattern = "-[^\w]"
myMatch = myregexp.Replace(splitted(j), "")
rozp.Add (Trim(myMatch))
End If
Next j
j = 1
For Each rozpoznanie In rozp
ws.Cells(i, 12 + j).Value = rozpoznanie
j = j + 1
Next rozpoznanie
Next i
End Sub
While it wasn't 100% accurate, the number of records I need to correct is about 1%, so I guess it works :)

Related

VBA Find a string that has range of value in it with Regular Expression and replace with each value in that range

First of all, sorry for the long title. I just don't know how to put it succinctly. I am trying to do this in VBA as normal Excel will not cut it.
Basically, I have a column. Each cells may contain data in the format of something like
flat 10-14;Flat 18-19;unit 7-9;flat A-D;ABC;DEF;
What I need is to find the string that has "-" in it, and attempt to replace it with anything in between. so the above code will become
Flat 10, Flat 11; Flat 12, Flat 14;Flat 18, Flat 19;Unit 7, Unit 8, Unit 9;Flat A, Flat B, Flat C; ABC;DEF;
With the help of this article on RegExpression, I have managed to work out how to expand the bits of data with number, which I will post the code below. However, I don't know a good way to expand the data with the letter. i.e from Flat A-C to Flat A, Flat B, Flat C
My code is below, please feel free to give any pointers if you think it can be more efficient. I am very much an amateur at this. Thank you in advance.
Sub CallRegEx()
Dim r As Match
Dim mcolResults As MatchCollection
Dim strInput As String, strPattern As String
Dim test As String, StrOutput As String, prefix As String
Dim startno As Long, endno As Long
Dim myrange As Range
strPattern = "(Flat|Unit) [0-9]+-+[0-9]+"
With Worksheets("Sheet1")
lrow = .Cells(Rows.Count, 9).End(xlUp).Row
For Each x In .Range("A2:A" & lrow)
strInput = Range("A" & x.Row).Value
Set mcolResults = RegEx(strInput, strPattern, True, , True)
If Not mcolResults Is Nothing Then
StrOutput = strInput
For Each r In mcolResults
startno = Mid(r, (InStr(r, "-") - 2), 2)
endno = Mid(r, (InStr(r, "-") + 1))
prefix = Mid(r, 1, 4)
test = ""
For i = startno To endno - 1
test = test & prefix & " " & i & ","
Next i
test = test & prefix & " " & endno
'this is because I don't want the comma at the end of the last value
StrOutput = Replace(StrOutput, r, test)
Debug.Print r ' remove in production
Next r
End If
.Range("D" & x.Row).Value = StrOutput
Next x
End With
End Sub
This function below is to support the Sub above
Function RegEx(strInput As String, strPattern As String, _
Optional GlobalSearch As Boolean, Optional MultiLine As Boolean, _
Optional IgnoreCase As Boolean) As MatchCollection
Dim mcolResults As MatchCollection
Dim objRegEx As New RegExp
If strPattern <> vbNullString Then
With objRegEx
.Global = GlobalSearch
.MultiLine = MultiLine
.IgnoreCase = IgnoreCase
.Pattern = strPattern
End With
If objRegEx.test(strInput) Then
Set mcolResults = objRegEx.Execute(strInput)
Set RegEx = mcolResults
End If
End If
End Function
Letters have character codes that are ordinal (A < B < C ...) & these can be accessed via asc()/chr$() - here is one way to do it:
inputStr = "flat 10-14;Flat 18-19;unit 7-9;flat A-D;ABC;DEF;flat 6;flat T"
Dim re As RegExp: Set re = New RegExp
re.Pattern = "(flat|unit)\s+((\d+)-(\d+)|([A-Z])-([A-Z]))"
re.Global = True
re.IgnoreCase = True
Dim m As MatchCollection
Dim start As Variant, fin As Variant
Dim tokens() As String
Dim i As Long, j As Long
Dim isDigit As Boolean
tokens = Split(inputStr, ";")
For i = 0 To UBound(tokens) '// loop over tokens
Set m = re.Execute(tokens(i))
If (m.Count) Then
With m.Item(0)
start = .SubMatches(2) '// first match number/letter
isDigit = Not IsEmpty(start) '// is letter or number?
If (isDigit) Then '// number
fin = .SubMatches(3)
Else '// letter captured as char code
start = Asc(.SubMatches(4))
fin = Asc(.SubMatches(5))
End If
tokens(i) = ""
'// loop over items
For j = start To fin
tokens(i) = tokens(i) & .SubMatches(0) & " " & IIf(isDigit, j, Chr$(j)) & ";"
Next
End With
ElseIf i <> UBound(tokens) Then tokens(i) = tokens(i) & ";"
End If
Next
Debug.Print Join(tokens, "")
flat 10;flat 11;flat 12;flat 13;flat 14;Flat 18;Flat 19;unit 7;unit 8;unit 9;flat A;flat B;flat C;flat D;ABC;DEF;flat 6;flat T

Extract text from 2 strings from selected Outlook email

I have code to import email body data from Outlook to Excel. I only need Name, ID, code from the email.
I have done everything except to extract the ID from a fixed sentence:
cn=SVCLMCH,OU=Users,OU=CX,DC=dm001,DC=corp,DC=dcsa,DC=com
The id is SVCLMCH in this case, that means I need to extract the text between "cn=" and ",OU=Users".
Sub import_code()
Dim O As Outlook.Application
Set O = New Outlook.Application
Dim ONS As Outlook.Namespace
Set ONS = O.GetNamespace("MAPI")
Dim OMAIL As Outlook.MailItem
Set OMAIL = Nothing
Dim ws As Object
Set ws = ThisWorkbook.Worksheets("Import code from Outlook")
Dim rcount As Long
Dim vText As Variant
Dim sText As String
Dim i As Long
If O.ActiveExplorer.Selection.Count = 0 Then
msgbox "No Items selected!", vbCritical, "Error"
End If
On Error Resume Next
'Process each selected record
rcount = ws.UsedRange.Rows.Count
For Each OMAIL In O.ActiveExplorer.Selection
sText = OMAIL.Body
vText = Split(sText, Chr(13))
'Find the next empty line of the worksheet
rcount = rcount + 1
'Check each line of text in the message body
For i = UBound(vText) To 0 Step -1
If InStr(1, vText(i), "Password Generated and set for:") > 0 Then
vItem = Split(vText(i), Chr(58))
ws.Range("A" & rcount) = Trim(vItem(1))
End If
If InStr(1, vText(i), "cn=") > 0 Then
vItem = Split(vText(i), Chr(58))
ws.Range("b" & rcount) = Trim(vItem(1))
End If
If InStr(1, vText(i), "Password:") > 0 Then
vItem = Split(vText(i), Chr(58))
ws.Range("c" & rcount) = Trim(vItem(1))
End If
Next i
Next OMAIL
End Sub
The trick here is to use the Split() function
Dim Arr() As String
Dim j As Integer
Dim k As Integer
Dim strvar As String
Dim strval As String
Dim strID As String
If InStr(1, vtext(i), "cn=") > 0 Then
' split the whole line in an array - "," beeing the value separator
Arr = Split(vtext(i), ",")
' loop through all array elements
For j = 0 To UBound(r) - 1
' find the position of =
k = InStr(Arr(j), "=")
strvar = Left(Arr(j), k - 1) ' this is the variable name, ie "cn"
strval = Mid(Arr(j), k + 1) ' this is the value, ie "SVCLMCH"
' now do what you want with a specific variable
Select Case strvar
Case "cn"
strID = strval
Case Else
' do nothing
End Select
Next j
End If
you can use a helper function like this:
Function GetID(strng As String)
Dim el As Variant
For Each el In Split(strng, ",")
If InStr(1, el, "cn=") > 0 Then
GetID = Mid(el, InStr(1, el, "cn=") + 3)
Exit Function
End If
Next
End Function
and your main code would exploit it as:
If InStr(1, vText(i), "cn=") > 0 Then ws.Range("b" & rcount) = GetID(CStr(vText(i)))
Use Regular Expression extract the ID from the sentence
Example Pattern = "(?<=cn=)([^.]+)(?=\,OU=Users)"
https://regex101.com/r/67u84s/2
Code Example
Option Explicit
Private Sub Examplea()
Dim Matches As Variant
Dim RegEx As Object
Set RegEx = CreateObject("VbScript.RegExp")
Dim olApp As Object
Set olApp = CreateObject("Outlook.Application")
Dim Item As Outlook.MailItem
Set Item = olApp.ActiveExplorer.Selection.Item(1)
Dim Pattern As String
Pattern = "(?<=cn=)([^.]+)(?=\,OU=Users)"
With RegEx
.Global = False
.Pattern = Pattern
.IgnoreCase = True
Set Matches = .Execute(Item.Body)
End With
If Matches.Count > 0 Then
Debug.Print Matches(0).SubMatches(0)
With ThisWorkbook.Sheets("Sheet1")
.Range("A1").Value = Trim(Matches(0).SubMatches(0))
End With
End If
End Sub

Replace all characters in a String, unless they are within double quotes

I am sadly unfamiliar with regular expressions since I'm not a programmer, but I would guess this problem is easily solvable using regex (I am definitely open to other suggestions, though)
I want to use the split function to split the value of a cell and spread it out over multiple cells. The delimiter is a comma. The problem though is that some users use commas in comments for example, which the Split function uses to split the string mid-comment.
for example a cell containing the value:
0001,"name","address","likes apples, oranges
and plums"
needs to be split into multiple cells saying 0001 "name" "address" and "likes apples, oranges and plums".
my code splits the comment as well, and I want it to ignore the comment or everything else withing double quotes. here is a sample:
Sub SplittingStrings()
Dim wb As Workbook
Dim ws As Worksheet
Dim strInput As String
Dim counter As Integer
Dim cell As Variant
Dim splitCount As Integer
Dim splitString() As String
Dim category As Variant
Set wb = ActiveWorkbook
Set ws = wb.ActiveSheet
counter = 1
For Each cell In Range("A1", "A2000")
If cell.Value <> "" Then
strInput = cell.Value
splitCount = 2
splitString = Split(strInput, ",")
For Each category In splitString
Cells(counter, splitCount).Value = category
splitCount = splitCount + 1
Next category
End If
counter = counter + 1
Next cell
End Sub
how do I exclude stuff withing the double quotes from being considered by the split function?
Please give this a try and see if you get the desired output.
Tweak the variables if required.
Sub SplittingStringsUsingRegEx()
Dim lr As Long, c As Long
Dim Rng As Range, cell As Range
Dim RE, Match, Matches
Application.ScreenUpdating = False
lr = Cells(Rows.Count, 1).End(xlUp).Row
Set Rng = Range("A1:A" & lr)
Set RE = CreateObject("VBScript.RegExp")
With RE
.Global = True
.Pattern = "\d+|"".+?"""
End With
c = 2
For Each cell In Rng
If RE.test(cell.Value) Then
Set Matches = RE.Execute(cell.Value)
For Each Match In Matches
Cells(cell.Row, c) = Replace(Match, """", "")
c = c + 1
Next Match
End If
c = 2
Next cell
Application.ScreenUpdating = True
End Sub
Without Regex:
We need to "protect" commas that are encapsulated with double quotes:
Sub ProtectStuff()
Dim i As Long, N As Long, v As String, v2 As String
Dim ProtectMode As Boolean, DQ As String, rep As String
Dim CH As String, arr
DQ = """"
rep = Chr(1)
N = Cells(Rows.Count, "A").End(xlUp).Row
For i = 1 To N
v = Cells(i, "A").Value
If v <> "" Then
ProtectMode = False
v2 = ""
For j = 1 To Len(v)
CH = Mid(v, j, 1)
If CH = DQ Then ProtectMode = Not ProtectMode
If CH = "," And ProtectMode Then CH = rep
v2 = v2 & CH
Next j
End If
arr = Split(v2, ",")
j = 2
For Each a In arr
Cells(i, j) = Replace(a, rep, ",")
j = j + 1
Next a
Next i
End Sub
Text to Columns will do what you want, differently than the split function.

How to find words on both sides of (period)

in the following examples I need to get the words on either side on the period
I am using this regex
Dim myRegex As New Regex("[^\w]+")
Dim mymatch As String() = myRegex.Split(currentField)
where as currentfield = one of the following 3 samples
Contacts.Address2 as `Contact Address2`
Contacts.ContactID
CONCAT(Contacts.FirstName;;' ';;Contacts.LastName) as `Contact`
returns are as follows.
1-- Contacts, Address2, as, Contact and Address2 do not want the word as.
2-- Contacts and ContactID this is ok.
3-- CONCAT,Contacts,FirstName,Contacts,LastName,as and Contact.
3rd one this is too much do not want CONCAT,as or Contact. I only want the four words (ones before and after the period) to be returned Contacts, Firstname, Contacts, and Lastname
how can I write the regex to only get words before and after the period
I would consider matching vs. splitting the input:
For Each m As Match In Regex.Matches(input, "(\w+)\.(\w+)")
Console.WriteLine(
String.Join(", ",
m.Groups(1).Value,
m.Groups(2).Value
))
Next
This is an example, It's not clear what you expect to do with the returned results.
Ideone Demo
I think you are looking to only split inside round brackets and you are not interested in the word as. Thus, I suggest a 2 step approach:
Get the substring(s) in round brackets (\([^()]+\) regex)
If there are such substrings, split them with the split regex, if not, split the original string with split regex (\W+|\s*\bas\b\s* regex).
Sample code:
'Dim currentField As String = "Contacts.Address2 as `Contact Address2`"
Dim currentField As String = "CONCAT(Contacts.FirstName;;' ';;Contacts.LastName) as `Contact`"
'Dim currentField As String = "Contacts.ContactID"
Dim myRegex As New Regex("\([^()]+\)")
Dim splitRegex As New Regex("\W+|\s*\bas\b\s*")
Dim mymatch As MatchCollection = myRegex.Matches(currentField)
If mymatch.Count > 0 Then
For Each match As Match In mymatch
Dim mysubstrs As String() = splitRegex.Split(match.Value)
For Each substr As String In mysubstrs
If String.IsNullOrEmpty(substr) = False Then
Console.WriteLine(substr)
End If
Next
Next
Else
Dim mysubstrs As String() = splitRegex.Split(currentField)
For Each substr As String In mysubstrs
If String.IsNullOrEmpty(substr) = False Then
Console.WriteLine(substr)
End If
Next
End If
here is the final working routine, based on the accepted answer above
Public Sub Load_Field_List(FieldSTR As String, FieldType As String)
Dim t As New FileIO.TextFieldParser(New System.IO.StringReader(FieldSTR))
t.TextFieldType = Microsoft.VisualBasic.FileIO.FieldType.Delimited
t.Delimiters = New String() {","}
Dim currentRow As String()
Dim dr As DataRow
Dim ColListSTR As String = loadeddataview.Tables(0).Rows(0).Item("ColumnList")
Dim ColListSTRArr As String() = ColListSTR.Split(",")
While Not t.EndOfData
Try
currentRow = t.ReadFields()
Dim currentField As String 'field string
For Each currentField In currentRow
Dim startName As Integer
Dim endName As Integer
Dim name As String
dr = fieldDT.NewRow
Dim isValid As Boolean = False
If currentField = "" Then 'make sure current field has data
isValid = False
ElseIf (Regex.IsMatch(currentField, "(\w+)\.(\w+)")) = True Then 'make sure current field has xxxx.yyyy pattern
Dim m As Match = Regex.Match(currentField, "(\w+)\.(\w+)") 'sets m to the first xxxx.yyyy pattern
dr("Table") = m.Groups(1).Value 'sets table column to table name xxxx
dr("Column Name") = "`" & m.Groups(2).Value & "`" 'sets column name to column yyyy enclosed in ` `
If ColListSTRArr.Contains(m.Groups(2).Value) Then 'checks columnlist str to see if column visible
dr("Show") = "True"
Else
dr("Show") = "False"
End If
' this section overrides column name if it was set using AS `zzzzz` statement
startName = currentField.IndexOf("`")
endName = currentField.IndexOf("`", If(startName > 0, startName + 1, 0))
If (endName > startName) Then
Dim mylength As Integer = currentField.Length
name = currentField.Substring(startName, endName - startName + 1)
dr("Column Name") = name 'set override columname
dr("Field") = currentField.Substring(0, startName - 4) 'sets field minus the " as 'ZZZZZ" above
If ColListSTRArr.Contains(currentField.Substring(startName + 1, endName - startName - 1)) Then 'dup may be able to remove
dr("Show") = "True"
Else
dr("Show") = "False"
End If
Else
dr("Field") = currentField 'sets field if there was no " as `ZZZZZZ`" in string
End If
If FieldType = "Field" Then 'sets the column linking field
dr("Linking") = "No Linking"
Else
dr("Linking") = FieldType
End If
End If
' commit changes
fieldDT.Rows.Add(dr)
fieldDT.AcceptChanges()
DataGridView3.DataSource = fieldDT
DataGridView3.ClearSelection()
Next
Catch ex As Microsoft.VisualBasic.
FileIO.MalformedLineException
MsgBox("Line " & ex.Message &
"is not valid and will be skipped.")
End Try
End While
End Sub

How can I sort a VBA MatchCollection by value of SubMatches(n)?

I am relatively new to programming, and I wrote a Microsoft Word VBA macro that extracts a "parts list" from a patent description (the text of the active document), where each part reference in the list is identified in a rudimentary way as anything that looks like a numeric or all-caps alpha identifier of a part or feature preceded by up to four words in the same sentence.
What I have succeeded in doing so far is automatically opening a new Word document and inserting all unique part references line by line, in a format like
"10: providing a sewing machine 10," or "Q: of a heat flux Q."
I repeat the identifier at the beginning of each line so that the identifiers appear aligned at the left margin.
I also would like them to be sorted by identifier, which is m.SubMatches(2) of my regular expression MatchCollection m. First the numbers in numerical order, then the alpha references in alphabetical order would be nice.
Any suggestions on how to go about this? Here is a code snippet that sorts by the entire m.Value using a simple bubble-sort algorithm, without bothering to convert numeric identifiers to Long values:
Sub ExtractPartsList()
Dim re As VBScript_RegExp_55.RegExp
Set re = New VBScript_RegExp_55.RegExp
re.pattern = "((?:[A-Z]*[a-z]+[\s\n]+){0,3})(?=[A-Z]*[a-z]+[\s\n]+(?:\d+\b|[A-Z]+\b))" + _
"(\b[A-Z]*[a-z]+[\s\n]+)(\b\d+\b'*|[A-Z]+\b'*)" + _
"((?:\,[\s\n]+(?:\d+|[A-Z]+\b))+(?:\,?[\s\n]+and[\s\n+](?:\d+|[A-Z]+\b))?)?(?:[\s\n]+and[\s\n]+(?:\d+|[A-Z]+\b))?"
' m.Value is the whole matched string
' m.SubMatches(1) is the word immediately preceding the part number / alpha reference
' m.SubMatches(2) is the part number / alpha reference
re.IgnoreCase = False
re.Global = True
Dim txt As String
Dim bigString As String
bigString = ""
Dim allLongMatches As MatchCollection, m As Match
Dim partNameLastWord As String
Dim partReference As String
Dim partNameAndReference As String
Dim partsColl As New Collection
Dim partsList() As String
Dim i As Long
txt = ActiveDocument.Range.text
If re.Test(txt) Then
Set allLongMatches = re.Execute(txt)
Documents.Add DocumentType:=wdNewBlankDocument
For Each m In allLongMatches
Debug.Print m.Value, "Sbm 1 = " + m.SubMatches(1), "Sbm 2 = " + m.SubMatches(2), "Sbm 3 = " + m.SubMatches(3)
If InStr(bigString, LCase(m.SubMatches(1) + m.SubMatches(2))) = 0 _
And InStr(LCase(m.Value), "of claim " + m.SubMatches(2)) = 0 _
And InStr(LCase(m.SubMatches(2)), "fig") = 0 Then
bigString = bigString + LCase(m.Value)
partsColl.Add m.SubMatches(2) + ": " + m.Value
End If
Next m
End If
ReDim partsList(1 To partsColl.Count)
For i = 1 To partsColl.Count
partsList(i) = partsColl(i)
Next i
' BubbleSort (partsList())
' Instead of calling BubbleSort (partsList())
' I apparently still have to learn how to properly call methods I
' have written - for now I am just embedding it here:
Dim strTemp As String
' Dim i As Long
Dim j As Long
Dim lngMin As Long
Dim lngMax As Long
lngMin = LBound(partsList())
lngMax = UBound(partsList())
For i = lngMin To lngMax - 1
For j = i + 1 To lngMax
If partsList(i) > partsList(j) Then
strTemp = partsList(i)
partsList(i) = partsList(j)
partsList(j) = strTemp
End If
Next j
Next i
For i = 1 To partsColl.Count
Selection.InsertAfter (partsList(i))
Selection.InsertParagraphAfter
Next i
End Sub
Sub BubbleSort(arr)
Dim strTemp As String
Dim i As Long
Dim j As Long
Dim lngMin As Long
Dim lngMax As Long
lngMin = LBound(arr)
lngMax = UBound(arr)
For i = lngMin To lngMax - 1
For j = i + 1 To lngMax
If arr(i) > arr(j) Then
strTemp = arr(i)
arr(i) = arr(j)
arr(j) = strTemp
End If
Next j
Next i
End Sub
Sample input from U.S. Pat. No. 6,293,874:
"The second post 44 is positioned a sufficient distance from the first post 24 to permit the user to require the user to bend forward at the waist in a stooped position between the posts 24, 44. The user is thus positioned to predominantly present his or her buttocks B toward the plurality of rotating arms 56 that are detachably mounted on the second post 44 at a height generally level with the user's buttocks. The second post 44 is mountable on the surface of the platform 12 by a detachable collar 46 and connector bolts or screws."
Output (only works nicely because the numbers are the same length - I imagine it is really sorting "alphabetically," where "2" would come after "19," for example):
' 12: surface of the platform 12
' 24: from the first post 24
' 24: position between the posts 24, 44
' 44: The second post 44
' 46: by a detachable collar 46
' 56: plurality of rotating arms 56
' B: his or her buttocks B
I made a klunky solution that works by creating a separate array of the part identifiers by themselves and sorting the partsList() array in parallel with the identifier array id() As Long. Setting alpha identifiers to zero for now and letting them percolate to the top unsorted; there are not usually enough of them to worry about sorting alphabetically. I hesitate to mark this as an answer, as I would still like to see if someone will chime in with a more elegant/direct solution.
Sub ExtractPartsList()
Dim re As VBScript_RegExp_55.RegExp
Set re = New VBScript_RegExp_55.RegExp
re.pattern = "((?:[A-Z]*[a-z]+[\s\n]+){0,3})(?=[A-Z]*[a-z]+[\s\n]+(?:\d+\b|[A-Z]+\b))" + _
"(\b[A-Z]*[a-z]+[\s\n]+)(\b\d+\b'*|[A-Z]+\b'*)" + _
"((?:\,[\s\n]+(?:\d+|[A-Z]+\b))+(?:\,?[\s\n]+and[\s\n+](?:\d+|[A-Z]+\b))?)?(?:[\s\n]+and[\s\n]+(?:\d+|[A-Z]+\b))?"
' m.Value is the whole matched string
' m.SubMatches(1) is the word immediately preceding the part number / alpha reference
' m.SubMatches(2) is the part number / alpha reference
re.IgnoreCase = False
re.Global = True
Dim txt As String
Dim bigString As String
bigString = ""
Dim allLongMatches As MatchCollection, m As Match
Dim partNameLastWord As String
Dim partReference As String
Dim partNameAndReference As String
Dim partsColl As New Collection
Dim idColl As New Collection
' for now not using this variable:
' Dim referenceTextColl As New Collection
Dim partsList() As String
Dim id() As Long
' Dim referenceText() As String
' Dim partsListSorted() As String
Dim i As Long
txt = ActiveDocument.Range.text
If re.Test(txt) Then
Set allLongMatches = re.Execute(txt)
Documents.Add DocumentType:=wdNewBlankDocument
For Each m In allLongMatches
Debug.Print m.Value, "Sbm 1 = " + m.SubMatches(1), "Sbm 2 = " + m.SubMatches(2), "Sbm 3 = " + m.SubMatches(3)
If InStr(bigString, LCase(m.SubMatches(1) + m.SubMatches(2))) = 0 _
And InStr(LCase(m.Value), "of claim " + m.SubMatches(2)) = 0 _
And InStr(LCase(m.SubMatches(2)), "fig") = 0 Then
bigString = bigString + LCase(m.Value)
partsColl.Add m.SubMatches(2) + ": " + m.Value
idColl.Add (m.SubMatches(2))
' referenceTextColl.Add (m.Value)
' Selection.InsertAfter (m.SubMatches(2) + ": ")
' Selection.InsertAfter (m.Value)
' Selection.InsertParagraphAfter
End If
Next m
End If
ReDim partsList(1 To partsColl.Count)
ReDim id(1 To partsColl.Count)
' ReDim referenceText(1 To partsColl.Count)
For i = 1 To partsColl.Count
partsList(i) = partsColl(i)
id(i) = 0
' Deal with "prime" symbols #' and convert numeric identifiers to Long:
If IsNumeric(Replace(idColl(i), "'", "")) Then id(i) = CLng(Replace(idColl(i), "'", ""))
referenceText(i) = referenceTextColl(i)
Next i
'
' I apparently still have to learn how to properly call methods I
' have written - I am just embedding a bubble sort algorithm here instead:
Dim idTemp As String
Dim referenceTemp As String
Dim partsListLineTemp As String
' Dim i As Long
Dim j As Long
Dim lngMin As Long
Dim lngMax As Long
lngMin = LBound(partsList())
lngMax = UBound(partsList())
For i = lngMin To lngMax - 1
For j = i + 1 To lngMax
If id(i) > id(j) Then
idTemp = id(i)
partsList(i) = partsList(j)
id(i) = id(j)
partsList(j) = partsListLineTemp
id(j) = idTemp
End If
Next j
Next i
For i = 1 To partsColl.Count
Selection.InsertAfter (partsList(i))
Selection.InsertParagraphAfter
Next i
partsListLineTemp = partsList(i)
End Sub