Regex to match data from a webpage - regex

This is probably a simple question for someone experienced with regex, but I'm having a little trouble. I'm looking to match lines of data like this shown below:
SomeAlpha Text CrLf CrLf 15 CrLf CrLf 123 132 143 CrLf CrLf 12313 CrLf CrLf 12/123
Where the "SomeAlpha Text" is just some text with space and potentially punctuation. The first number is something between 1 and 30,000. The second set of numbers (123 132 143) are between 1 and 500,000 (each number). The next number is somewhere between 1 and 500,000. The final set is (1–30,000)/(1–30,000). This is the code I've put together so far:
Dim Pattern As String = "[.*]{1,100}" & vbCrLf & "" & vbCrLf & "[0-9]{1,4}" & vbCrLf & "" & vbCrLf & "[0-9]{1,6] [0-9]{1,6] [0-9]{1,6]" & vbCrLf & "" & vbCrLf & "[0-9]{1,6}" & vbCrLf & "" & vbCrLf & "[0-9]{1,5}/[0-9]{1,5}"
For Each match As Match In Regex.Matches(WebBrowser1.DocumentText.ToString, Pattern, RegexOptions.IgnoreCase)
RichTextBox1.AppendText(match.ToString & Chr(13) & Chr(13))
Next
And I'm currently getting 0 matches, even though I know there should be at least 1 match. Any advice on where my pattern is wrong would be great! Thanks.

"[.*]{1,100}" & vbCrLf & "" & vbCrLf & "[0-9]{1,4}" & vbCrLf & "" & vbCrLf & "[0-9]{1,6] [0-9]{1,6] [0-9]{1,6]" & vbCrLf & "" & vbCrLf & "[0-9]{1,6}" & vbCrLf & "" & vbCrLf & "[0-9]{1,5}/[0-9]{1,5}"
has quite a few problems:
The * in "[.*]{1,100}" tells the previous character to repeat as many times as possible, and is therefore unnecessary. Replace it with ".{1,100}" or ".*"
You say the first number is between 0 and 30000. "[0-9]{1,4}" only allows for 4 digits (0 to 9999). Replace it with "[0-9]{1,5}", which allows for any number between 0 and 99999.
You accidentally put ] instead of } at three places in this part: "[0-9]{1,6] [0-9]{1,6] [0-9]{1,6]". Replace it with "[0-9]{1,6} [0-9]{1,6} [0-9]{1,6}"
Try doing what I said above. It should work correctly.

Related

How to only match if line begins with [duplicate]

How can I check if a string starts (or ends) with vbCrLf?
I've tried with substring but it doesn't seem to work:
Dim s As String = ""
s &= vbCrLf & "Test"
If s.Substring(0, 1) = vbCrLf Then
MsgBox("Yes")
End If
Try this
StartsWith - checks the first part of a String.
Dim s As String = "vbCrLf bla bla bla"
If s.StartsWith("vbCrLf") Then
MsgBox("Yes")
End If
EndsWith - checks the last characters of a String.
Dim s As String = "bla bla bla vbCrLf"
If s.EndsWith("vbCrLf") Then
MsgBox("Yes")
End If
Left (and Right) can be used in place of .StartsWith/.EndsWith if you're having issues with these.
Dim s As String
s = vbCrLf & "Test"
If Left(s, Len(vbCrLf)) = vbCrLf Then
MsgBox("Yes")
End If
The simplest solution I could come up with for determining whether a string contains another string anywhere is to use the split function:
Dim s As String
Dim i As Integer
Dim v As Variant
s = vbCrLf & "Test"
i = Split(s, vbCrLf)
For Each item In i
j = j + 1
Next item
If j > 1 or vbCrLf = "" Then
MsgBox("Yes")
End If
I believe you are asking if there is a way to see if the string begins with a line brake. [Line Feed Return] (vbLf) or [Carriage Return] (vbCr). I use the Chr value to do that. Chr(13) is vbCr. Chr(10) is vbLf. You can use Asc to find out what the first Chr of a string is. Something like this;
Dim s As String = vbCr & "bla bla bla"
If Asc(s) = 13 Then
MsgBox("s strats with a Carriage Return")
else If Asc(s) = 10 Then
MsgBox("s strats with a Line Feed Return")
End If

Regex to replace word except in comments

How can I modify my regex so that it will ignore the comments in the pattern in a language that doesn't support lookbehind?
My regex pattern is:
\b{Word}\b(?=([^"\\]*(\\.|"([^"\\]*\\.)*[^"\\]*"))*[^"]*$)
\b{Word}\b : Whole word, {word} is replaced iteratively for the vocab list
(?=([^""\](\.|""([^""\]\.)[^""\]""))[^""]$) : Don't replace anything inside of quotes
My goal is to lint variables and words so that they always have the same case. However I do not want to lint any words in a comment. (The IDE sucks and there is no other option)
Comments in this language are prefixed by an apostrophe. Sample code follows
' This is a comment
This = "Is not" ' but this is
' This is a comment, what is it's value?
Object.value = 1234 ' Set value
value = 123
Basically I want the linter to take the above code and say for the word "value" update it to:
' This is a comment
This = "Is not" ' but this is
' This is a comment, what is it's value?
Object.Value = 1234 ' Set value
Value = 123
So that all code based "Value" are updated but not anything in double quotes or in a comment or part of another word such as valueadded wouldn't be touched.
I've tried several solutions but haven't been able to get it to work.
['.*] : Not preceeding an apostrophy
(?<!\s*') : BackSearch not with any spaces with apoostrophy
(?<!\s*') : Second example seemed incorrect but this won't work as the language doesn't support backsearches
Anybody have any ideas how I can alter my pattern so that I don't edit commented variables
VBA
Sub TestSO()
Dim Code As String
Dim Expected As String
Dim Actual As String
Dim Words As Variant
Code = "item = object.value ' Put item in value" & vbNewLine & _
"some.item <> some.otheritem" & vbNewLine & _
"' This is a comment, what is it's value?" & vbNewLine & _
"Object.value = 1234 ' Set value" & vbNewLine & _
"value = 123" & vbNewLine
Expected = "Item = object.Value ' Put item in value" & vbNewLine & _
"some.Item <> some.otheritem" & vbNewLine & _
"' This is a comment, what is it's value?" & vbNewLine & _
"Object.Value = 1234 ' Set value" & vbNewLine & _
"Value = 123" & vbNewLine
Words = Array("Item", "Value")
Actual = SOLint(Words, Code)
Debug.Print Actual = Expected
Debug.Print "CODE: " & vbNewLine & Code
Debug.Print "Actual: " & vbNewLine & Actual
Debug.Print "Expected: " & vbNewLine & Expected
End Sub
Public Function SOLint(ByVal Words As Variant, ByVal FileContents As String) As String
Const NotInQuotes As String = "(?=([^""\\]*(\\.|""([^""\\]*\\.)*[^""\\]*""))*[^""]*$)"
Dim RegExp As Object
Dim Regex As String
Dim Index As Variant
Set RegExp = CreateObject("VBScript.RegExp")
With RegExp
.Global = True
.IgnoreCase = True
End With
For Each Index In Words
Regex = "[('*)]\b" & Index & "\b" & NotInQuotes
RegExp.Pattern = Regex
FileContents = RegExp.Replace(FileContents, Index)
Next Index
SOLint = FileContents
End Function
As discussed in the comments above:
((?:\".*\")|(?:'.*))|\b(v)(alue)\b
3 Parts to this regex used with alternation.
A non-capturing group for text within double quotes, as we dont need that.
A non-capturing group for text starting with single quote
Finally the string "value" is split into two parts (v) and (value) because while replacing we can use \U($2) to convert v to V and rest as is so \E$3 where \U - converts to upper case and \E - turns off the case.
\b \b - word boundaries are used to avoid any stand-alone text which is not part of setting a value.
https://regex101.com/r/mD9JeR/8

Regular Expressions Finding A Set of Numbers

I am stumped on trying to figure out regular expressions so I thought I would ask the big dogs.
I have a string that can range from 1-4 sets as follows:
1234-abcd, baa74739, maps21342, 6789
Now I have figured out the regular expressions for the 1234-abcd, baa74739, and maps21342. However, I am having trouble figuring out a code to pull the numbers that stand alone. Does anyone have an opinion on a way around this?
Example of the regex I used:
dbout.Range("D7").Formula = "=RegexExtract(DH7," & Chr(34) & "([M][A][P][S]\d+)" & Chr(34) & ")"
dbout.Range("D7").AutoFill Destination:=dbout.Range("D7:D2000")
for digit stand alone replace
dbout.Range("D7").Formula = "=RegexExtract(DH7," & Chr(34) & "([M][A][P][S]\d+)" & Chr(34) & ")"
dbout.Range("D7").AutoFill Destination:=dbout.Range("D7:D2000")
with
dbout.Range("D7").Formula = "=RegexExtract(DH7," & Chr(34) & "(\b\d+\b)" & Chr(34) & ")"
dbout.Range("D7").AutoFill Destination:=dbout.Range("D7:D2000")
OR
dbout.Range("D7").Formula = "=RegexExtract(DH7,""(\b\d+\b)"")"
dbout.Range("D7").AutoFill Destination:=dbout.Range("D7:D2000")

Using vbscript regular expression to conditional replace dates

I would like to use regex to replace the actual dates in the string to YYYYMMDD. However, my string might contain 2 types of dates, it could either be 20160531 or 160531. For these two, I have to replace them with YYYYMMDD and YYMMDD. So the followings are two examples:
Employment_salary_20160531 -> Employment_salary_YYYYMMDD
Employment_salary_160531 -> Employment_salary_YYMMDD
Wondering if it is possible to do this within a single regex without using an IFELSE statement?
Thank you!
This will provide you with accurate validation of the date that's entered. The other regex will work but it's dirty. It will accept 5000 as year.
The short answer: ((19|20)\d{2}|[0-9]{2})(0[1-9]|1[0-2])([012][0-9]|3[0-1])
The Long but thoroughly tested answer...
stringtest1 = "Employment_salary_20160531"
stringtest2 = "Employment_salary_990212"
stringtest3 = "Employment_salary_990242"
wscript.echo : wscript.echo "---------------------------------------------------" : wscript.echo
wscript.echo "Trying: " & stringtest1 & vbcrlf & vbcrlf & vbtab & " => " & sanitizedate(stringtest1)
wscript.echo : wscript.echo "---------------------------------------------------" : wscript.echo
wscript.echo "Trying: " & stringtest2 & vbcrlf & vbcrlf & vbtab & " => " & sanitizedate(stringtest2)
wscript.echo : wscript.echo "---------------------------------------------------" : wscript.echo
wscript.echo "Trying: " & stringtest3 & vbcrlf & vbcrlf & vbtab & " => " & sanitizedate(stringtest3)
wscript.echo : wscript.echo "---------------------------------------------------" : wscript.echo
Function sanitizedate(str)
Set objRE = New RegExp
objRE.Pattern = "((19|20)\d{2}|[0-9]{2})(0[1-9]|1[0-2])([012][0-9]|3[0-1])"
objRE.IgnoreCase = True
objRE.Global = False
objRE.Multiline = true
Set objMatch = objRE.Execute(str)
If objMatch.Count = 1 Then
Select Case Len(objMatch.Item(0))
Case "8"
sanitizedate = Replace(str, objMatch.Item(0), "YYYYMMDD")
Case "6"
sanitizedate = Replace(str, objMatch.Item(0), "YYMMDD")
End Select
Else
sanitizedate = str
End if
End Function
Validation Results
Trying: Employment_salary_20160531
=> Employment_salary_YYYYMMDD
Trying: Employment_salary_990212
=> Employment_salary_YYMMDD
Trying: Employment_salary_990242 failed because 42 is not a valid date
=> Employment_salary_990242
I'm not sure I get you right. But seems there is two different replacement YYYYMMDD and YYMMDD which doing that is impossible by just one single pattern.
You can match those two separated pattern by this:
/(^(\d{4})(\d{2})(\d{2})$)|(^(\d{2})(\d{2})(\d{2})$)/
Online Demo
As you see, pattern above matches both 20160531 and 160531. But you cannot replace them with both YYYYMMDD (for 20160531) and YYMMDD (for 160531). You actually can replace them with either YYYYMMDD or YYMMDD.
Otherwise you need two separated patterns if you want two separated replacements:
/^(\d{4})(\d{2})(\d{2})$/
/* and replace with `YYYYMMDD` */
/^(\d{2})(\d{2})(\d{2})$/
/* and replace with YYMMDD */

Find text between two static strings

I parse message data into a CSV file via Outlook rules.
How can I take the example below and store the text under "Customer Log Update:" into a string variable?
[Header Data]
Description: Problem: A2 - MI ERROR - R8036
Customer Log Update:
I'm having trouble with order #458362. I keep getting Error R8036, can you please assist?
Thanks!
View problem at http://...
[Footer Data]
Desired result to be stored into the string variable (note that the result may contain newlines):
I'm having trouble with order #458362. I keep getting Error R8036, can you please assist?
Thanks!
I haven't attempted to code anything pertaining to my question.
Function RegFind(RegInput, RegPattern)
Dim regEx As New VBScript_RegExp_55.RegExp
Dim matches, s
regEx.Pattern = RegPattern
regEx.IgnoreCase = True
regEx.Global = False
s = ""
If regEx.Test(RegInput) Then
Set matches = regEx.Execute(RegInput)
For Each Match In matches
s = Match.Value
Next
RegFind = s
Else
RegFind = ""
End If
End Function
Sub CustomMailMessageRule(Item As Outlook.MailItem)
MsgBox "Mail message arrived: " & Item.Subject
Const FileWrite = file.csv `file destination
Dim FF1 As Integer
Dim subj As String
Dim bod As String
On Error GoTo erh
subj = Item.Subject
'this gets a 15 digit number from the subject line
subj = RegFind(subj, "\d{15}")
bod = Item.Body
'following line helps formatting, lots of double newlines in my source data
bod = Replace(bod, vbCrLf & vbCrLf, vbCrLf)
'WRITE FILE
FF1 = FreeFile
Open FileWrite For Append As #FF1
Print #FF1, subj & "," & bod
Close #FF1
Exit Sub
erh:
MsgBox Err.Description, vbCritical, Err.Number
End Sub
While I would also go the more direct route like Jean-François Corbett did as the parsing is very simple, you could apply the Regexp approach as below
The pattern
Update:([\S\s]+)view
says match all characters between "Update" and "view" and return them as a submatch
This piece [\S\s] says match all non-whitespace or whitespace characters - ie everything.
In vbscript a . matches everything but a newline, hence the need for the [\S\s] workaround for this application
The submatch is then extracted by
objRegM(0).submatches(0)
Function ExtractText(strIn As String)
Dim objRegex As Object
Dim objRegM As Object
Set objRegex = CreateObject("vbscript.regexp")
With objRegex
.ignorecase = True
.Pattern = "Update:([\S\s]+)view"
If .test(strIn) Then
Set objRegM = .Execute(strIn)
ExtractText = objRegM(0).submatches(0)
Else
ExtractText = "No match"
End If
End With
End Function
Sub JCFtest()
Dim messageBody As String
Dim result As String
messageBody = "Description: Problem: A2 - MI ERROR - R8036" & vbCrLf & _
"Customer Log Update:" & _
"I 'm having trouble with order #458362. I keep getting Error R8036, can you please assist?" & vbCrLf & _
"Thanks!" & vbCrLf & _
"View problem at http://..."
MsgBox ExtractText(messageBody)
End Sub
Why not something simple like this:
Function GetCustomerLogUpdate(messageBody As String) As String
Const sStart As String = "Customer Log Update:"
Const sEnd As String = "View problem at"
Dim iStart As Long
Dim iEnd As Long
iStart = InStr(messageBody, sStart) + Len(sStart)
iEnd = InStr(messageBody, sEnd)
GetCustomerLogUpdate = Mid(messageBody, iStart, iEnd - iStart)
End Function
I tested it using this code and it worked:
Dim messageBody As String
Dim result As String
messageBody = "Description: Problem: A2 - MI ERROR - R8036" & vbCrLf & _
"Customer Log Update:" & vbCrLf & _
"I 'm having trouble with order #458362. I keep getting Error R8036, can you please assist?" & vbCrLf & _
"Thanks!" & vbCrLf & _
"View problem at http://..."
result = GetCustomerLogUpdate(messageBody)
Debug.Print result