I am stumped on trying to figure out regular expressions so I thought I would ask the big dogs.
I have a string that can range from 1-4 sets as follows:
1234-abcd, baa74739, maps21342, 6789
Now I have figured out the regular expressions for the 1234-abcd, baa74739, and maps21342. However, I am having trouble figuring out a code to pull the numbers that stand alone. Does anyone have an opinion on a way around this?
Example of the regex I used:
dbout.Range("D7").Formula = "=RegexExtract(DH7," & Chr(34) & "([M][A][P][S]\d+)" & Chr(34) & ")"
dbout.Range("D7").AutoFill Destination:=dbout.Range("D7:D2000")
for digit stand alone replace
dbout.Range("D7").Formula = "=RegexExtract(DH7," & Chr(34) & "([M][A][P][S]\d+)" & Chr(34) & ")"
dbout.Range("D7").AutoFill Destination:=dbout.Range("D7:D2000")
with
dbout.Range("D7").Formula = "=RegexExtract(DH7," & Chr(34) & "(\b\d+\b)" & Chr(34) & ")"
dbout.Range("D7").AutoFill Destination:=dbout.Range("D7:D2000")
OR
dbout.Range("D7").Formula = "=RegexExtract(DH7,""(\b\d+\b)"")"
dbout.Range("D7").AutoFill Destination:=dbout.Range("D7:D2000")
Related
How can I modify my regex so that it will ignore the comments in the pattern in a language that doesn't support lookbehind?
My regex pattern is:
\b{Word}\b(?=([^"\\]*(\\.|"([^"\\]*\\.)*[^"\\]*"))*[^"]*$)
\b{Word}\b : Whole word, {word} is replaced iteratively for the vocab list
(?=([^""\](\.|""([^""\]\.)[^""\]""))[^""]$) : Don't replace anything inside of quotes
My goal is to lint variables and words so that they always have the same case. However I do not want to lint any words in a comment. (The IDE sucks and there is no other option)
Comments in this language are prefixed by an apostrophe. Sample code follows
' This is a comment
This = "Is not" ' but this is
' This is a comment, what is it's value?
Object.value = 1234 ' Set value
value = 123
Basically I want the linter to take the above code and say for the word "value" update it to:
' This is a comment
This = "Is not" ' but this is
' This is a comment, what is it's value?
Object.Value = 1234 ' Set value
Value = 123
So that all code based "Value" are updated but not anything in double quotes or in a comment or part of another word such as valueadded wouldn't be touched.
I've tried several solutions but haven't been able to get it to work.
['.*] : Not preceeding an apostrophy
(?<!\s*') : BackSearch not with any spaces with apoostrophy
(?<!\s*') : Second example seemed incorrect but this won't work as the language doesn't support backsearches
Anybody have any ideas how I can alter my pattern so that I don't edit commented variables
VBA
Sub TestSO()
Dim Code As String
Dim Expected As String
Dim Actual As String
Dim Words As Variant
Code = "item = object.value ' Put item in value" & vbNewLine & _
"some.item <> some.otheritem" & vbNewLine & _
"' This is a comment, what is it's value?" & vbNewLine & _
"Object.value = 1234 ' Set value" & vbNewLine & _
"value = 123" & vbNewLine
Expected = "Item = object.Value ' Put item in value" & vbNewLine & _
"some.Item <> some.otheritem" & vbNewLine & _
"' This is a comment, what is it's value?" & vbNewLine & _
"Object.Value = 1234 ' Set value" & vbNewLine & _
"Value = 123" & vbNewLine
Words = Array("Item", "Value")
Actual = SOLint(Words, Code)
Debug.Print Actual = Expected
Debug.Print "CODE: " & vbNewLine & Code
Debug.Print "Actual: " & vbNewLine & Actual
Debug.Print "Expected: " & vbNewLine & Expected
End Sub
Public Function SOLint(ByVal Words As Variant, ByVal FileContents As String) As String
Const NotInQuotes As String = "(?=([^""\\]*(\\.|""([^""\\]*\\.)*[^""\\]*""))*[^""]*$)"
Dim RegExp As Object
Dim Regex As String
Dim Index As Variant
Set RegExp = CreateObject("VBScript.RegExp")
With RegExp
.Global = True
.IgnoreCase = True
End With
For Each Index In Words
Regex = "[('*)]\b" & Index & "\b" & NotInQuotes
RegExp.Pattern = Regex
FileContents = RegExp.Replace(FileContents, Index)
Next Index
SOLint = FileContents
End Function
As discussed in the comments above:
((?:\".*\")|(?:'.*))|\b(v)(alue)\b
3 Parts to this regex used with alternation.
A non-capturing group for text within double quotes, as we dont need that.
A non-capturing group for text starting with single quote
Finally the string "value" is split into two parts (v) and (value) because while replacing we can use \U($2) to convert v to V and rest as is so \E$3 where \U - converts to upper case and \E - turns off the case.
\b \b - word boundaries are used to avoid any stand-alone text which is not part of setting a value.
https://regex101.com/r/mD9JeR/8
I would like to use regex to replace the actual dates in the string to YYYYMMDD. However, my string might contain 2 types of dates, it could either be 20160531 or 160531. For these two, I have to replace them with YYYYMMDD and YYMMDD. So the followings are two examples:
Employment_salary_20160531 -> Employment_salary_YYYYMMDD
Employment_salary_160531 -> Employment_salary_YYMMDD
Wondering if it is possible to do this within a single regex without using an IFELSE statement?
Thank you!
This will provide you with accurate validation of the date that's entered. The other regex will work but it's dirty. It will accept 5000 as year.
The short answer: ((19|20)\d{2}|[0-9]{2})(0[1-9]|1[0-2])([012][0-9]|3[0-1])
The Long but thoroughly tested answer...
stringtest1 = "Employment_salary_20160531"
stringtest2 = "Employment_salary_990212"
stringtest3 = "Employment_salary_990242"
wscript.echo : wscript.echo "---------------------------------------------------" : wscript.echo
wscript.echo "Trying: " & stringtest1 & vbcrlf & vbcrlf & vbtab & " => " & sanitizedate(stringtest1)
wscript.echo : wscript.echo "---------------------------------------------------" : wscript.echo
wscript.echo "Trying: " & stringtest2 & vbcrlf & vbcrlf & vbtab & " => " & sanitizedate(stringtest2)
wscript.echo : wscript.echo "---------------------------------------------------" : wscript.echo
wscript.echo "Trying: " & stringtest3 & vbcrlf & vbcrlf & vbtab & " => " & sanitizedate(stringtest3)
wscript.echo : wscript.echo "---------------------------------------------------" : wscript.echo
Function sanitizedate(str)
Set objRE = New RegExp
objRE.Pattern = "((19|20)\d{2}|[0-9]{2})(0[1-9]|1[0-2])([012][0-9]|3[0-1])"
objRE.IgnoreCase = True
objRE.Global = False
objRE.Multiline = true
Set objMatch = objRE.Execute(str)
If objMatch.Count = 1 Then
Select Case Len(objMatch.Item(0))
Case "8"
sanitizedate = Replace(str, objMatch.Item(0), "YYYYMMDD")
Case "6"
sanitizedate = Replace(str, objMatch.Item(0), "YYMMDD")
End Select
Else
sanitizedate = str
End if
End Function
Validation Results
Trying: Employment_salary_20160531
=> Employment_salary_YYYYMMDD
Trying: Employment_salary_990212
=> Employment_salary_YYMMDD
Trying: Employment_salary_990242 failed because 42 is not a valid date
=> Employment_salary_990242
I'm not sure I get you right. But seems there is two different replacement YYYYMMDD and YYMMDD which doing that is impossible by just one single pattern.
You can match those two separated pattern by this:
/(^(\d{4})(\d{2})(\d{2})$)|(^(\d{2})(\d{2})(\d{2})$)/
Online Demo
As you see, pattern above matches both 20160531 and 160531. But you cannot replace them with both YYYYMMDD (for 20160531) and YYMMDD (for 160531). You actually can replace them with either YYYYMMDD or YYMMDD.
Otherwise you need two separated patterns if you want two separated replacements:
/^(\d{4})(\d{2})(\d{2})$/
/* and replace with `YYYYMMDD` */
/^(\d{2})(\d{2})(\d{2})$/
/* and replace with YYMMDD */
I have this simple line of code using regular expressions where I want to substitute pieces of strings with empty space:
newAddress = myAddress.replace(/^.*?(ramp|arterial|majorroad|street &|highway &|highway|street|street &|street & highway|arterial & street|street & arterial|majorroad &|majorroad & ramp|ramp & majorroad|major road|highway & majorroad)\,/gi, '');
but having in a variable this:
Highway & Contrada Torremuzza, 95121 Catania CT
why it didn't removed the "highway &" part?
It looks to me like you need neither the .* nor the comma. The .* will cause you to replace everything that precedes your string.
Try just this:
(ramp|arterial|majorroad|street &|highway &|highway|street|street &|street & highway|arterial & street|street & arterial|majorroad &|majorroad & ramp|ramp & majorroad|major road|highway & majorroad)
Or, if you're in a mood for fancy optimizations:
(?:majorroad & )?ramp|(?:major r|(?:(?:ramp|highway) & )?majorr)oad|(?:highway|majorroad|street) &|(?:arterial & )?street|(?:street & )?(?:arterial|highway)
Just kidding. In theory this is more efficient, but it's harder to maintain.
It is trying to match a comma as well, you need to make the comma optional or remove it in this case. Also unless you want to remove the preceding text as well remove the beginning of string ^ anchor and .*?
newAddress = myAddress.replace(/(ramp|arterial|majorroad|street &|highway &|highway|street|street &|street & highway|arterial & street|street & arterial|majorroad &|majorroad & ramp|ramp & majorroad|major road|highway & majorroad)/gi, '');
I think I just solved by myself with:
newAddress = myAddress.replace(/^.*?ramp|arterial|majorroad|street|highway| &|\,/gi, '');
shorter and more efficient...so at least it will match the word plus the &
Cheers,
Luigi
This is probably a simple question for someone experienced with regex, but I'm having a little trouble. I'm looking to match lines of data like this shown below:
SomeAlpha Text CrLf CrLf 15 CrLf CrLf 123 132 143 CrLf CrLf 12313 CrLf CrLf 12/123
Where the "SomeAlpha Text" is just some text with space and potentially punctuation. The first number is something between 1 and 30,000. The second set of numbers (123 132 143) are between 1 and 500,000 (each number). The next number is somewhere between 1 and 500,000. The final set is (1–30,000)/(1–30,000). This is the code I've put together so far:
Dim Pattern As String = "[.*]{1,100}" & vbCrLf & "" & vbCrLf & "[0-9]{1,4}" & vbCrLf & "" & vbCrLf & "[0-9]{1,6] [0-9]{1,6] [0-9]{1,6]" & vbCrLf & "" & vbCrLf & "[0-9]{1,6}" & vbCrLf & "" & vbCrLf & "[0-9]{1,5}/[0-9]{1,5}"
For Each match As Match In Regex.Matches(WebBrowser1.DocumentText.ToString, Pattern, RegexOptions.IgnoreCase)
RichTextBox1.AppendText(match.ToString & Chr(13) & Chr(13))
Next
And I'm currently getting 0 matches, even though I know there should be at least 1 match. Any advice on where my pattern is wrong would be great! Thanks.
"[.*]{1,100}" & vbCrLf & "" & vbCrLf & "[0-9]{1,4}" & vbCrLf & "" & vbCrLf & "[0-9]{1,6] [0-9]{1,6] [0-9]{1,6]" & vbCrLf & "" & vbCrLf & "[0-9]{1,6}" & vbCrLf & "" & vbCrLf & "[0-9]{1,5}/[0-9]{1,5}"
has quite a few problems:
The * in "[.*]{1,100}" tells the previous character to repeat as many times as possible, and is therefore unnecessary. Replace it with ".{1,100}" or ".*"
You say the first number is between 0 and 30000. "[0-9]{1,4}" only allows for 4 digits (0 to 9999). Replace it with "[0-9]{1,5}", which allows for any number between 0 and 99999.
You accidentally put ] instead of } at three places in this part: "[0-9]{1,6] [0-9]{1,6] [0-9]{1,6]". Replace it with "[0-9]{1,6} [0-9]{1,6} [0-9]{1,6}"
Try doing what I said above. It should work correctly.
Let's suppose I have a data set of several hundred thousand strings (which happen to be natural language sentences, if it matters) which are each tagged with a certain "label". Each sentence is tagged with exactly one label, and there are about 10 labels, each with approximately 10% of the data set belonging to them. There is a high degree of similarity to the structure of sentences within a label.
I know the above sounds like a classical example of a machine learning problem, but I want to ask a slightly different question. Are there any known techniques for programatically generating a set of regular expressions for each label, which can successfully classify the training data while still generalizing to future test data?
I would be very happy with references to the literature; I realize that this will not be a straightforward algorithm :)
PS: I know that the normal way to do classification is with machine learning techniques like an SVM or such. I am, however, explicitly looking for a way to generate regular expressions. (I would be happy with with machine learning techniques for generating the regular expressions, just not with machine learning techniques for doing the classification itself!)
This problem is usually framed as how to generate finite automata from sets of strings, rather than regular expressions, though you can obviously generate REs from FAs since they are equivalent.
If you search around for automata induction, you should be able to find quite a lot of literature on this topic, including GA approaches.
So far as I know, this is the subject of current research in evolutionary computation.
Here are some examples:
See slides 40-44 at
https://cs.byu.edu/sites/default/files/Ken_De_Jong_slides.pdf
(slides exist as of the posting of this answer).
Also, see
http://www.citeulike.org/user/bartolialberto/article/10710768
for a more detailed review of a system presented at GECCO 2012.
Note: May this would help someway. This below function generates RegEx pattern for a given value of a and b. Where a and b both are alpha-strings. And the function would generate a fair RegEx pattern to match the range between a and b. The function would take only first three chars to produce the pattern and produces a result that might be something like starts-with() function in some language with hint of a general RegEx favor.
A simple VB.NET example
Public Function GetRangePattern(ByVal f_surname As String, ByVal l_surname As String) As String
Dim f_sn, l_sn As String
Dim mnLength% = 0, mxLength% = 0, pdLength% = 0, charPos% = 0
Dim fsn_slice$ = "", lsn_slice$ = ""
Dim rPattern$ = "^"
Dim alphas As New Collection
Dim tmpStr1$ = "", tmpStr2$ = "", tmpStr3$ = ""
'///init local variables
f_sn = f_surname.ToUpper.Trim
l_sn = l_surname.ToUpper.Trim
'///do null check
If f_sn.Length = 0 Or l_sn.Length = 0 Then
Return "-!ERROR!-"
End If
'///return if both equal
If StrComp(f_sn, l_sn, CompareMethod.Text) = 0 Then
Return "^" & f_sn & "$"
End If
'///return if 1st_name present in 2nd_name
If InStr(1, l_sn, f_sn, CompareMethod.Text) > 0 Then
tmpStr1 = f_sn
tmpStr2 = l_sn.Replace(f_sn, vbNullString)
If Len(tmpStr2) > 1 Then
tmpStr3 = "[A-" & tmpStr2.Substring(1) & "]*"
Else
tmpStr3 = tmpStr2 & "*"
End If
tmpStr1 = "^" & tmpStr1 & tmpStr3 & ".*$"
tmpStr1 = tmpStr1.ToUpper
Return tmpStr1
End If
'///initialize alphabets
alphas.Add("A", CStr(Asc("A")))
alphas.Add("B", CStr(Asc("B")))
alphas.Add("C", CStr(Asc("C")))
alphas.Add("D", CStr(Asc("D")))
alphas.Add("E", CStr(Asc("E")))
alphas.Add("F", CStr(Asc("F")))
alphas.Add("G", CStr(Asc("G")))
alphas.Add("H", CStr(Asc("H")))
alphas.Add("I", CStr(Asc("I")))
alphas.Add("J", CStr(Asc("J")))
alphas.Add("K", CStr(Asc("K")))
alphas.Add("L", CStr(Asc("L")))
alphas.Add("M", CStr(Asc("M")))
alphas.Add("N", CStr(Asc("N")))
alphas.Add("O", CStr(Asc("O")))
alphas.Add("P", CStr(Asc("P")))
alphas.Add("Q", CStr(Asc("Q")))
alphas.Add("R", CStr(Asc("R")))
alphas.Add("S", CStr(Asc("S")))
alphas.Add("T", CStr(Asc("T")))
alphas.Add("U", CStr(Asc("U")))
alphas.Add("V", CStr(Asc("V")))
alphas.Add("W", CStr(Asc("W")))
alphas.Add("X", CStr(Asc("X")))
alphas.Add("Y", CStr(Asc("Y")))
alphas.Add("Z", CStr(Asc("Z")))
'///populate max-min length values
mxLength = f_sn.Length
If l_sn.Length > mxLength Then
mnLength = mxLength
mxLength = l_sn.Length
Else
mnLength = l_sn.Length
End If
'///padding values
pdLength = mxLength - mnLength
f_sn = f_sn.PadRight(mxLength, "A")
'f_sn = f_sn.PadRight(mxLength, "~")
l_sn = l_sn.PadRight(mxLength, "Z")
'l_sn = l_sn.PadRight(mxLength, "~")
'///get a range like A??-B??
If f_sn.Substring(0, 1).ToUpper <> l_sn.Substring(0, 1).ToUpper Then
fsn_slice = f_sn.Substring(0, 3).ToUpper
lsn_slice = l_sn.Substring(0, 3).ToUpper
tmpStr1 = fsn_slice.Substring(0, 1) & fsn_slice.Substring(1, 1) & "[" & fsn_slice.Substring(2, 1) & "-Z]"
tmpStr2 = lsn_slice.Substring(0, 1) & lsn_slice.Substring(1, 1) & "[A-" & lsn_slice.Substring(2, 1) & "]"
tmpStr3 = "^(" & tmpStr1 & "|" & tmpStr2 & ").*$"
Return tmpStr3
End If
'///looping charwise
For charPos = 0 To mxLength
fsn_slice = f_sn.Substring(charPos, 1)
lsn_slice = l_sn.Substring(charPos, 1)
If StrComp(fsn_slice, lsn_slice, CompareMethod.Text) = 0 Then
rPattern = rPattern & fsn_slice
Else
'rPattern = rPattern & "("
If charPos < mxLength Then
Try
If Asc(fsn_slice) < Asc(lsn_slice) Then
tmpStr1 = fsn_slice & "[" & f_sn.Substring(charPos + 1, 1) & "-Z" & "]|"
If CStr(alphas.Item(Key:=CStr(Asc(fsn_slice) + 1))) < CStr(alphas.Item(Key:=CStr(Asc(lsn_slice) - 1))) Then
tmpStr2 = "[" & CStr(alphas.Item(Key:=CStr(Asc(fsn_slice) + 1))) & "-" & CStr(alphas.Item(Key:=CStr(Asc(lsn_slice) - 1))) & "]|"
Else
tmpStr2 = vbNullString
End If
tmpStr3 = lsn_slice & "[A-" & l_sn.Substring(charPos + 1, 1) & "]"
rPattern = rPattern & "(" & tmpStr1 & tmpStr2 & tmpStr3 & ").*$"
'MsgBox("f_sn:= " & f_sn & " -- l_sn:= " & l_sn & vbCr & rPattern)
Exit For
Else
Return "-#ERROR#-"
End If
Catch ex As Exception
Return "-|ERROR|-" & ex.Message
End Try
End If
End If
Next charPos
Return rPattern
End Function
And it is called as
?GetRangePattern("ABC","DEF")
produces this
"^(AB[C-Z]|DE[A-F]).*$"