VBScript Regex Fill Submatches even when not Required for the Match - regex

I'm trying to replicate Google calendar's method of creating an appointment from a narrative. I want to enter 5pm Happy Hour for 1 hour and parse it into, ultimately, an Outlook AppointmentItem.
My problem, I think, is I have a large chunk of optional text at the end. And because it's optional, the regex passes but the submatch doesn't get populated because it isn't required for the match. I want it to populate because I want to use the submatches as my parsing engine.
I have a bunch of test cases in column A (working in Excel, then will move to Outlook), and my code lists out the submatches to the right. This is a representative sample of potential input
1. 5pmCST Happy Hour for 1 hour
2. 5pm CST Happy Hour for 1 hour
3. 5pm Happy Hour for 1 hour
4. 5 pm Happy Hour for 1 hour
5. 5 pm CST Happy Hour for 1 hour
6. 5 Happy Hour for 1 hour
7. 5 Happy Hour
8. 5pmCST Happy Hour
9. 5pm CST Happy Hour
10. 5pm Happy Hour
11. 5:00CST Happy Hour for 1 hour
12. 5:00 CST Happy Hour for 1 hour
Here's the code that runs the tests
Sub testest()
Dim RegEx As VBScript_RegExp_55.RegExp
Dim Matches As VBScript_RegExp_55.MatchCollection
Dim Match As VBScript_RegExp_55.Match
Dim rCell As Range
Dim SubMatch As Variant
Dim lCnt As Long
Dim aPattern(1 To 8) As String
Set RegEx = New VBScript_RegExp_55.RegExp
aPattern(1) = "(1?[0-9](:[0-5][0-9])?)" 'time
aPattern(2) = "( ?)" 'optional space
aPattern(3) = "([ap]m)?" 'optional ampm
aPattern(4) = "( ?)" 'optional space
aPattern(5) = "([ECMP][DS]T)?" 'optional time zone
aPattern(6) = "( ?)" 'optional space
aPattern(7) = "(.+?)" 'event description
aPattern(8) = "(( for )([1-2]?[0-9](.[0-9]?[0-9])?)( hours?))?" 'optional duration
RegEx.Pattern = Join(aPattern, vbNullString)
Debug.Print RegEx.Pattern
Sheet1.Range("C1").Resize(1000, 100).ClearContents
For Each rCell In Sheet1.Range("A1").CurrentRegion.Columns(1).Cells
lCnt = 0
rCell.Offset(0, 2).Value = RegEx.test(rCell.Text)
If RegEx.test(rCell.Text) Then
Set Matches = RegEx.Execute(rCell.Text)
For Each Match In Matches
For Each SubMatch In Match.SubMatches
lCnt = lCnt + 1
rCell.Offset(0, 2 + lCnt).Value = SubMatch
Next SubMatch
Next Match
End If
Next rCell
End Sub
The pattern is
(1?[0-9](:[0-5][0-9])?)( ?)([ap]m)?( ?)([ECMP][DS]T)?( ?)(.+?)(( for )([1-2]?[0-9](.[0-9]?[0-9])?)( hours?))?
The submatches for #1 are
1 2 3 4 5 6 7
5 pm CST H
It stops matching at the "H" in Happy Hour because everything starting with the " for " is optional. If I remove the optional part, my pattern becomes
(1?[0-9](:[0-5][0-9])?)( ?)([ap]m)?( ?)([ECMP][DS]T)?( ?)(.+?)( for )([1-2]?[0-9](.[0-9]?[0-9])?)( hours?)
But #7-#10 don't pass because they don't have a duration. The submmatches for #1 give me what I want though
1 2 3 4 5 6 7 8 9 10 11
5 pm CST Happy Hour for 1 hour
I want every possible submatch to fill even if VBScript doesn't need it to to make the regex pass. I fear this is just how it works and that I'm trying to get regex to do my parsing work for me. I considered running it through increasingly more restrictive patterns until it doesn't pass, then using the last passing pattern, but that seems kludgy.
Is it possible to get regex to fill those submatches?

I have assumed each line is all the contents in a single cell. So I am able to use anchors.
I also don't think you need as many capturing groups as you have. I set up the regex with:
Group 1 Time
Group 2 am/pm
Group 3 Time Zone
Group 4 Description
Group 5 Hours (and fractions of hours)
With your data in A2:An, the following routine parses the data into the adjacent columns. It doesn't matter if a Submatch is "not filled". You could also fill elements in an array, or whatever else you want to do. If you want more submatches, you can always either add capturing groups for the optional spaces, or change the relevant non-capturing groups to capturing groups.
Also, since the "for" is optional, I chose to use a lookahead to determine the end of "description". Description will end with either a \s+for\s+ sequence; or with the "end of line". Since I have assumed there is only one entry, and one line, per cell, the multiline and global properties are irrelevant.
One has to include spaces before and after "for" so as to avoid problems if that sequence is included in Description.
Option Explicit
'set Reference to Microsoft VBScript Regular Expressions 5.5
Sub ParseAppt()
Dim R As Range, C As Range
Dim RE As RegExp, MC As MatchCollection
Dim I As Long
Set R = Range("a2", Cells(Rows.Count, "A").End(xlUp))
Set RE = New RegExp
With RE
.Pattern = "((?:1[0-2]|0?[1-9])(?::[0-5]\d)?)\s*([ap]m)?\s*([ECMT][DS]T)?\s*(.*?(?=\s+for\s+|$))(?:\s+for\s+(\d+(?:\.\d+)?)\s*hour)?"
.IgnoreCase = True
For Each C In R
If .Test(C.Text) = True Then
Set MC = .Execute(C.Text)
For I = 0 To 4
C.Offset(0, I + 1) = MC(0).SubMatches(I)
Next I
End If
Next C
End With
End Sub

Related

VBA RegEx: How to find the first instance of a number after a specific string and ignore all other characters?

Having trouble with writing code that will pick up the pattern I want. I want to be able to grab the first number that comes up after the words 5 Months in the .txt file that I have. If there are any other characters A-Z, parentheses, $, % etc. I want to ignore them. I keep getting an error code with VBA such as the INVALID PROCEDURE CALL OR ARGUMENT.
Currently, I have code that looks like this:
Dim reg4 As Object: Set reg4 = CreateObject("vbscript.regexp")
reg4.Pattern = "5 Months\s*([\d+]\.[\d+])\s*"
Dim MCS As Object
Set MCS = reg4.Execute(myText)
**Dim Months5 As String: Months5 = MCS(0).submatches(0)** *the error stems from this line*
where mytext is a string that consists of content from a text file. My main problem is that this text file is not always in a standardized format, so when I want to extract the first number after "5 Months" it gives me that error.
The text file could look like:
EXAMPLE 1
5 Months
($) (%) (Months) (%) (%) (%) ($) (Months)
0.00 0.0000 0.000
OR
EXAMPLE 2
5 Months
0.00
0.000
0.000
In both cases, I would ideally be able to extract that first number "0.00" in its entire form, while ignoring any other characters such as (%) or ($) as shown in example 1.
I would like to ask if anyone has any suggestions on how to rewrite the pattern statement so it will be able to pick up the first numeric instance along with the numbers after its decimal point?
Many thanks in advance!
Your regex does not match the strings you showed. You can use
\b5 Months[\s\S]*?(\d+(?:\.\d+)?)
See the regex demo. Details:
\b - a word boundary
5 Months - a literal text
[\s\S]*? - any 0 or more chars, as few as possible
(\d+(?:\.\d+)?) - Capturing group 1: one or more digits followed with an optional sequence of a . and one or more digits.
Test run in VBA:
Sub TestFn()
Dim reg4 As Object: Set reg4 = CreateObject("vbscript.regexp")
reg4.Pattern = "\b5 Months[\s\S]*?(\d+(?:\.\d+)?)"
Dim myText As String
myText = "5 Months" & vbCrLf & vbCrLf & "0.00"
Dim MCS As Object
Set MCS = reg4.Execute(myText)
Dim Months5 As String: Months5 = MCS(0).SubMatches(0)
Debug.Print (Months5)
End Sub

Cannot seem to find the index where regex gets a match

I'm using vb.net and something that seems obvious in PHP does not work (for me) in vb.net:
Extract = "100011100000"
Dim HandReg3 As New Regex("(?:[0]*)(1{3,})(?:[0]*)")
If HandReg3.IsMatch(Extract) Then
Dim m() As String = HandReg3.Split(Extract)
For Each item As String In m
Console.WriteLine(item)
Next
Dim m1 As Match = Regex.Match(Extract, "(?:[0]*)(1{3,})(?:[0]*)")
Console.WriteLine(m1.Index)
Console.WriteLine(m1.Length)
End If
Which writes:
1
111
1 <-Index. Shouldn't it be 4 (0 based) or 5????
111
I've tried several combinations of groups (capturing and not capturing).
Of course it has to be something obvious and stupid but after almost 8 hours. I just can't get it right!
Tried:
([0-1]*)(1{3,})([0-1]*)
[0-1]*1{3,}[0-1]*
and every other combination that I can thought of in between.
Thanks for your help
Emiliano

Can I use regular expressions, the Like operator, and/or Instr() find the index of a pattern within a larger string?

I have a large list (a table with one field) of non-standardized strings, imported from a poorly managed legacy database. I need to extract the single-digit number (surrounded by spaces) that occurs exactly once in each of those strings (though the strings have other multi-digit numbers sometimes too). For example, from the following string:
"Quality Assurance File System And Records Retention Johnson, R.M. 004 4 2999 ss/ds/free ReviewMo = Aug Effective 1/31/2012 FileOpen-?"
I would want to pull the number 4 (or 4's position in the string, i.e. 71)
I can use
WHERE rsLegacyList.F1 LIKE "* # *"
inside a select statement to find if each string has a lone digit, and thereby filter my list. But it doesn't tell me where the digit is so I can extract the digit itself (with mid() function) and start sorting the list. The goal is to create a second field with that digit by itself as a method of sorting the larger strings in the first field.
Is there a way to use Instr() along with regular expressions to find where a regular expression occurs within a larger string? Something like
intMarkerLocation = instr(rsLegacyList.F1, Like "* # *")
but that actually works?
I appreciate any suggestions, or workarounds that avoid the problem entirely.
#Lee Mac, I made a function RegExFindStringIndex as shown here:
Public Function RegExFindStringIndex(strToSearch As String, strPatternToMatch As String) As Integer
Dim regex As RegExp
Dim Matching As Match
Set regex = New RegExp
With regex
.MultiLine = False
.Global = True
.IgnoreCase = False
.Pattern = strPatternToMatch
Matching = .Execute(strToSearch)
RegExFindStringIndex = Matching.FirstIndex
End With
Set regex = Nothing
Set Matching = Nothing
End Function
But it gives me an error Invalid use of property at line Matching = .Execute(strToSearch)
Using Regular Expressions
If you were to use Regular Expressions, you would need to define a VBA function to instantiate a RegExp object, set the pattern property to something like \s\d\s (whitespace-digit-whitespace) and then invoke the Execute method to obtain a match (or matches), each of which will provide an index of the pattern within the string. If you want to pursue this route, here are some existing examples for Excel, but the RegExp manipulation will be identical in MS Access.
Here is an example function demonstrating how to use the first result returned by the Execute method:
Public Function RegexInStr(strStr As String, strPat As String) As Integer
With New RegExp
.Multiline = False
.Global = True
.IgnoreCase = False
.Pattern = strPat
With .Execute(strStr)
If .Count > 0 Then RegexInStr = .Item(0).FirstIndex + 1
End With
End With
End Function
Note that the above uses early binding and so you will need to add a reference to the Microsoft VBScript Regular Expressions 5.5 library to your project.
Example Immediate Window evaluation:
?InStr("abc 1 123", " 1 ")
4
?RegexInStr("abc 1 123", "\s\w\s")
4
Using InStr
An alternative using the in-built instr function within a query might be the following inelegant (and probably very slow) query:
select
switch
(
instr(rsLegacyList.F1," 0 ")>0,instr(rsLegacyList.F1," 0 ")+1,
instr(rsLegacyList.F1," 1 ")>0,instr(rsLegacyList.F1," 1 ")+1,
instr(rsLegacyList.F1," 2 ")>0,instr(rsLegacyList.F1," 2 ")+1,
instr(rsLegacyList.F1," 3 ")>0,instr(rsLegacyList.F1," 3 ")+1,
instr(rsLegacyList.F1," 4 ")>0,instr(rsLegacyList.F1," 4 ")+1,
instr(rsLegacyList.F1," 5 ")>0,instr(rsLegacyList.F1," 5 ")+1,
instr(rsLegacyList.F1," 6 ")>0,instr(rsLegacyList.F1," 6 ")+1,
instr(rsLegacyList.F1," 7 ")>0,instr(rsLegacyList.F1," 7 ")+1,
instr(rsLegacyList.F1," 8 ")>0,instr(rsLegacyList.F1," 8 ")+1,
instr(rsLegacyList.F1," 9 ")>0,instr(rsLegacyList.F1," 9 ")+1,
true, null
) as intMarkerLocation
from
rsLegacyList
where
rsLegacyList.F1 like "* # *"
How about:
select
instr(rsLegacyList.F1, " # ") + 1 as position
from rsLegacyList.F1
where rsLegacyList.F1 LIKE "* # *"

Find overlapping matches and submatches using regular expressions in Python

I have a string of characters (a DNA sequence) with a regular expression I designed to filter out possible matches, (?:ATA|ATT)[ATCGN]{144,16563}(?:AGA|AGG|TAA|TAG). Later I apply two filter conditions:
The sequence must be divisible by 3, len(match) % 3 == 0, and
There must be no stop codon (['AGA', 'AGG', 'TAA', 'TAG']) before the end of the string, not any(substring in list(sliced(match[:-3], 3)) for substring in [x.replace("U", "T") for x in stop_codons]).
However, when I apply these filters, I get no matches at all (before the filters I get around ~200 matches. The way I'm searching for substrings in the full sequence is by running re.findall(regex, genome_fasta, overlapped=True), because matches could be submatches of other matches.
Is there something about regular expressions that I'm misunderstanding? To my knowledge, after the filters I should still have matches.
If there's anything else I need to add please let me know! (I'm using the regex package for Python 3.4, not the standard re package, because it has no overlap support).
EDIT 1:
Per comment: I'm looking for ORFs in the mitochondrial genome, but only considering those at least 150 nucleotides (characters) in length. Considering overlap is important because a match could include the first start codon in the string and the last stop codon in the string, but there could be another start codon in the middle. For example:
Given "ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG", matches should be "ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG" but also "ATAAAGCCATTTACCGTACATAGCACATTATAA", since both "TAG" and "TAA" are stop codons.
EDIT 2:
Totally, misunderstood comment, full code for method is:
typical_regex = r"%s[ATCGN]{%s,%s}%s" % (proc_start_codons, str(minimum_orf_length - 6), str(maximum_orf_length - 6), proc_stop_codons)
typical_fwd_matches = []
if re.search(typical_regex, genome_fasta, overlapped=True):
for match in re.findall(typical_regex, genome_fasta, overlapped=True):
if len(match) % 3 == 0:
if not any(substring in list(sliced(match[:-3], 3)) for substring in [x.replace("U", "T") for x in stop_codons]):
typical_fwd_matches.append(match)
print(typical_fwd_matches)
The typical_fwd_matches array is empty and the regex is rendered as (?:ATA|ATT)[ATCGN]{144,16563}(?:AGA|AGG|TAA|TAG) when printed to console/file.
I think you can do it this way.
The subsets will consist of ever decreasing size of the previous matches.
That's about all you have to do.
So, it's fairly straight forward to design the regex.
The regex will only match multiples of 3 chars.
The beginning and middle are captured in group 1.
This is used for the new text value which is just the last match
minus the last 3 chars.
Regex explained:
( # (1 start), Whole match minus last 3 chars
(?: ATA | ATT ) # Starts with one of these 3 char sequence
(?: # Cluster group
[ATCGN]{3} # Any 3 char sequence consisting of these chars
)+ # End cluster, do 1 to many times
) # (1 end)
(?: AGA | AGG | TAA | TAG ) # Last 3 char sequence, one of these
Python code sample:
Demo
import re
r = re.compile(r"((?:ATA|ATT)(?:[ATCGN]{3})+)(?:AGA|AGG|TAA|TAG)")
text = "ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG"
m = r.search(text)
while m:
print("Found: " + m.group(0))
text = m.group(1)
m = r.search(text)
Output:
Found: ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG
Found: ATAAAGCCATTTACCGTACATAGCACATTATAA
Found: ATTTACCGTACATAG
Using this method, the subsets being tested are these:
ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG
ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAAC
ATAAAGCCATTTACCGTACATAGCACATTA
ATTTACCGTACA
We can benchmark the time the regex takes to match these.
Regex1: ((?:ATA|ATT)(?:[ATCGN]{3})+)(?:AGA|AGG|TAA|TAG)
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 3
Elapsed Time: 1.63 s, 1627.59 ms, 1627594 µs
Matches per sec: 92,160

RegEx Pattern in VB for Multiline Matches

I already made it to get the information in single line. I have a list of information like:
1 1 838028476391 4 23 36 P 1/820-01 *
2 1 838028476490 4 23 36 P 1/820-17 *
3 1 838028474271 4 23 36 P 1/820-21 *
4 1 838028476292 4 23 36 P 1/820-21 *
5 1 838028474263 4 23 36 P 1/820-23 *
6 1 838028473802 4 23 36 P 1/820-21 *
And I need the 12 digits numbers from every line. I tried this code:
Dim re As String
Dim re18 As String
re18 = "(\d{12})"
Dim r3 As New RegExp
r3.Pattern = re18
r3.IgnoreCase = True
r3.MultiLine = True
If r3.Test(Body) Then
Dim m3 As MatchCollection
Set m3 = r3.Execute(Body)
If m3.Item(0).SubMatches.Count > 0 Then
Dim number
For j = 1 To m3.Count
Set number = m3.Item(j - 1)
MsgBox ("Number: " + number)
Next
End If
End If
I only get the first match - even if I debug the makro and view m3 in the watch - there is only 1 match. I also tried to use the quantifiers * or + after \d{12}
How do I get this RegEx working?
And regarding RegEx I have another question: If I want to match something AFTER a special word i would put the word in the pattern at the beginning and behind that the numbers or whatever I want. If I execute this regex - do I get the information or match INCLUDING the word I put at the beginning of my pattern?!
Like: "BUS \d{12}" and I only want the numbers as a result but know that BUS stands before the numbers...
You need to use the Global option, not Multiline. Multiline changes the behavior the anchors (^ and $) so they match the beginning and end of each line, not just the beginning and end of the whole text. Global is the option that tells it to find all the matches, not just the first one.
You probably don't need to use the SubMatches property either. Your regex has only the one capturing group, which captures the whole match. That means m3.SubMatches will only contain one Item, Item(0), and it will be exactly the same as m3.Item(0). (Notice that the index of the first group is 0, not 1 as you would expect from working with other regex tools.)
Your second question is where the SubMatches property comes in. If you wanted to find every 12-digit number that follows the word "BUS" you would use a regex like this:
BUS\s*(\d{12})
...and you would retrieve the number from each match like this:
Set m3 = r3.Execute(Body)
For Each myMatch in m3
MsgBox("Number: " + m3.SubMatches(0).Value)
Next
See this page for more info.