Techniques for extracting regular expressions out of a labeled data set

Techniques for extracting regular expressions out of a labeled data set - regex

Let's suppose I have a data set of several hundred thousand strings (which happen to be natural language sentences, if it matters) which are each tagged with a certain "label". Each sentence is tagged with exactly one label, and there are about 10 labels, each with approximately 10% of the data set belonging to them. There is a high degree of similarity to the structure of sentences within a label.
I know the above sounds like a classical example of a machine learning problem, but I want to ask a slightly different question. Are there any known techniques for programatically generating a set of regular expressions for each label, which can successfully classify the training data while still generalizing to future test data?
I would be very happy with references to the literature; I realize that this will not be a straightforward algorithm :)
PS: I know that the normal way to do classification is with machine learning techniques like an SVM or such. I am, however, explicitly looking for a way to generate regular expressions. (I would be happy with with machine learning techniques for generating the regular expressions, just not with machine learning techniques for doing the classification itself!)

This problem is usually framed as how to generate finite automata from sets of strings, rather than regular expressions, though you can obviously generate REs from FAs since they are equivalent.
If you search around for automata induction, you should be able to find quite a lot of literature on this topic, including GA approaches.

So far as I know, this is the subject of current research in evolutionary computation.
Here are some examples:
See slides 40-44 at
https://cs.byu.edu/sites/default/files/Ken_De_Jong_slides.pdf
(slides exist as of the posting of this answer).
Also, see
http://www.citeulike.org/user/bartolialberto/article/10710768
for a more detailed review of a system presented at GECCO 2012.

Note: May this would help someway. This below function generates RegEx pattern for a given value of a and b. Where a and b both are alpha-strings. And the function would generate a fair RegEx pattern to match the range between a and b. The function would take only first three chars to produce the pattern and produces a result that might be something like starts-with() function in some language with hint of a general RegEx favor.
A simple VB.NET example
Public Function GetRangePattern(ByVal f_surname As String, ByVal l_surname As String) As String
Dim f_sn, l_sn As String
Dim mnLength% = 0, mxLength% = 0, pdLength% = 0, charPos% = 0
Dim fsn_slice$ = "", lsn_slice$ = ""
Dim rPattern$ = "^"
Dim alphas As New Collection
Dim tmpStr1$ = "", tmpStr2$ = "", tmpStr3$ = ""
'///init local variables
f_sn = f_surname.ToUpper.Trim
l_sn = l_surname.ToUpper.Trim
'///do null check
If f_sn.Length = 0 Or l_sn.Length = 0 Then
Return "-!ERROR!-"
End If
'///return if both equal
If StrComp(f_sn, l_sn, CompareMethod.Text) = 0 Then
Return "^" & f_sn & "$"
End If
'///return if 1st_name present in 2nd_name
If InStr(1, l_sn, f_sn, CompareMethod.Text) > 0 Then
tmpStr1 = f_sn
tmpStr2 = l_sn.Replace(f_sn, vbNullString)
If Len(tmpStr2) > 1 Then
tmpStr3 = "[A-" & tmpStr2.Substring(1) & "]*"
Else
tmpStr3 = tmpStr2 & "*"
End If
tmpStr1 = "^" & tmpStr1 & tmpStr3 & ".*$"
tmpStr1 = tmpStr1.ToUpper
Return tmpStr1
End If
'///initialize alphabets
alphas.Add("A", CStr(Asc("A")))
alphas.Add("B", CStr(Asc("B")))
alphas.Add("C", CStr(Asc("C")))
alphas.Add("D", CStr(Asc("D")))
alphas.Add("E", CStr(Asc("E")))
alphas.Add("F", CStr(Asc("F")))
alphas.Add("G", CStr(Asc("G")))
alphas.Add("H", CStr(Asc("H")))
alphas.Add("I", CStr(Asc("I")))
alphas.Add("J", CStr(Asc("J")))
alphas.Add("K", CStr(Asc("K")))
alphas.Add("L", CStr(Asc("L")))
alphas.Add("M", CStr(Asc("M")))
alphas.Add("N", CStr(Asc("N")))
alphas.Add("O", CStr(Asc("O")))
alphas.Add("P", CStr(Asc("P")))
alphas.Add("Q", CStr(Asc("Q")))
alphas.Add("R", CStr(Asc("R")))
alphas.Add("S", CStr(Asc("S")))
alphas.Add("T", CStr(Asc("T")))
alphas.Add("U", CStr(Asc("U")))
alphas.Add("V", CStr(Asc("V")))
alphas.Add("W", CStr(Asc("W")))
alphas.Add("X", CStr(Asc("X")))
alphas.Add("Y", CStr(Asc("Y")))
alphas.Add("Z", CStr(Asc("Z")))
'///populate max-min length values
mxLength = f_sn.Length
If l_sn.Length > mxLength Then
mnLength = mxLength
mxLength = l_sn.Length
Else
mnLength = l_sn.Length
End If
'///padding values
pdLength = mxLength - mnLength
f_sn = f_sn.PadRight(mxLength, "A")
'f_sn = f_sn.PadRight(mxLength, "~")
l_sn = l_sn.PadRight(mxLength, "Z")
'l_sn = l_sn.PadRight(mxLength, "~")
'///get a range like A??-B??
If f_sn.Substring(0, 1).ToUpper <> l_sn.Substring(0, 1).ToUpper Then
fsn_slice = f_sn.Substring(0, 3).ToUpper
lsn_slice = l_sn.Substring(0, 3).ToUpper
tmpStr1 = fsn_slice.Substring(0, 1) & fsn_slice.Substring(1, 1) & "[" & fsn_slice.Substring(2, 1) & "-Z]"
tmpStr2 = lsn_slice.Substring(0, 1) & lsn_slice.Substring(1, 1) & "[A-" & lsn_slice.Substring(2, 1) & "]"
tmpStr3 = "^(" & tmpStr1 & "|" & tmpStr2 & ").*$"
Return tmpStr3
End If
'///looping charwise
For charPos = 0 To mxLength
fsn_slice = f_sn.Substring(charPos, 1)
lsn_slice = l_sn.Substring(charPos, 1)
If StrComp(fsn_slice, lsn_slice, CompareMethod.Text) = 0 Then
rPattern = rPattern & fsn_slice
Else
'rPattern = rPattern & "("
If charPos < mxLength Then
Try
If Asc(fsn_slice) < Asc(lsn_slice) Then
tmpStr1 = fsn_slice & "[" & f_sn.Substring(charPos + 1, 1) & "-Z" & "]|"
If CStr(alphas.Item(Key:=CStr(Asc(fsn_slice) + 1))) < CStr(alphas.Item(Key:=CStr(Asc(lsn_slice) - 1))) Then
tmpStr2 = "[" & CStr(alphas.Item(Key:=CStr(Asc(fsn_slice) + 1))) & "-" & CStr(alphas.Item(Key:=CStr(Asc(lsn_slice) - 1))) & "]|"
Else
tmpStr2 = vbNullString
End If
tmpStr3 = lsn_slice & "[A-" & l_sn.Substring(charPos + 1, 1) & "]"
rPattern = rPattern & "(" & tmpStr1 & tmpStr2 & tmpStr3 & ").*$"
'MsgBox("f_sn:= " & f_sn & " -- l_sn:= " & l_sn & vbCr & rPattern)
Exit For
Else
Return "-#ERROR#-"
End If
Catch ex As Exception
Return "-|ERROR|-" & ex.Message
End Try
End If
End If
Next charPos
Return rPattern
End Function
And it is called as
?GetRangePattern("ABC","DEF")
produces this
"^(AB[C-Z]|DE[A-F]).*$"

Related

Lookbehind in Access VBA

I am trying to search for all occurrences of doubles in a long chunk of text. the text represents the description of multiple defects in a system. The doubles I am looking for are depths that are normally in the text multiple times as "n.nnnW X n.nnnL X n.nnnD". The n.nnn is normally 0.017D (example) but I want to account for 5.567D if that ever comes up.
The problem is that there are also occurrences of the terms within 1.5d of and also a .015dia. Case for the letters in these are also varied some are all caps and some are all lowercase. The text sometimes also has a space between the number and the "d" and sometimes spells out the word "deep" or "depth" like this: 0.017 deep.
I need the values to be extracted as doubles eventually so I can do math on them.
I have the following regexp pattern: [.](?:\d*\.)?\d+(\s?)[dD](?!ia|IA)
This pattern seems to find all the things I need and even eliminates the diameters that are spelled out as n.nnndia or n.nnnDIA. The thing the pattern DOES NOT catch is the within 1.5d of text.
After some light research I noted that in VBA the lookbehind code is NOT supported; and even so, I was never able to get the lookbehind pattern to work anyhow (using Regex101).
Here is my Access VBA code to illustrate how I am doing it. The r EXP(0) value is my pattern above.
Set rs2 = CurrentDb.OpenRecordset("SELECT * FROM [NCR_RawImport-TEXT] WHERE " & skinTermSQL, dbOpenSnapshot)
'rEXP(0) = "[.](?:\d*\.)?\d+(\s?)[dD](?!ia|IA)"
tVAL = (CDbl(Me.cmbGDepth.Value) / 1000)
Do While Not rs2.EOF
found = False
Set regEXP1 = CreateObject("VBScript.RegExp")
regEXP1.IgnoreCase = True
regEXP1.Global = True
For i = 0 To rEXPIndx - 1
regEXP1.Pattern = rEXP(i)
Set Matches = regEXP1.Execute(rs2.NARR_TXT)
For Each Match In Matches
aVAL = CDbl(Trim(Replace(Replace(UCase(Match.Value), " D", ""), "D", ""))) 'convert matched value to a double.
If (aVAL >= tVAL) Then
found = True
End If
Next
Set Matches = Nothing
Next i
Set regEXP1 = Nothing
If (found) Then
strSql = "UPDATE [NCR_FinalData] SET [NCR_FinalData].SRCH = [NCR_FinalData].SRCH & 'G' WHERE [NCR_FinalData].SRCH Not Like '*G*' AND [NCR_FinalData].NC_KEY = '" & rs2.NC_KEY & "';"
Call writeLog("cmdUpdateNCRs: " & strSql)
DoCmd.RunSQL strSql
End If
rs2.MoveNext
Loop
Set rs2 = Nothing

Excel VBA to find and mask PAN data using regex for PCI DSS compliance

Because most of the tools to discover credit card data in file systems does no more that list the suspicious files, tools are needed to mask any data in files that must be retained.
For excel files, where loads of credit card data may exist, I figure a macro that detects credit card data in the selected column/row using regex and replaces the middle 6-8 digits with Xs would be useful to many. Sadly, I'm not a guru in the regex macro space.
The below basically works with regex for 3 card brands only, and works if the PAN is in a cell with other data (e.g. comments fields)
The below code works, but could be improved. It would be good to improve the regex to make it work for more/all card brands and reduce false-positives by including a LUHN algorithm check.
Improvements/Problems remaining :
Match all card brand's PANs with expanded regex
Include Luhn algorithm checking (FIXED - good idea Ron)
Improve the Do While logic (FIXED by stribizhev)
Even better handling of cells that don't contain PANs (FIXED)
Here's what I have so far which seems to be working ok for AmEx, Visa and Mastercard:
Sub PCI_mask_card_numbers()
' Written to mask credit card numbers in excel files in accordance with PCI DSS.
' Highlight the credit card data in the Excel sheet, then run this macro.
Dim strPattern As String: strPattern = "([4][0-9]{3})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{4})|" & _
"([5][0-9]{3})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{4})|" & _
"([3][0-9]{2})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{4})|" & _
"([3][0-9]{3})([^a-zA-Z0-9_]?[0-9]{3})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{4})|" & _
"([3][0-9]{3})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{3})([^a-zA-Z0-9_]?[0-9]{4})|" & _
"([3][0-9]{3})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{3})|" & _
"([3][0-9]{3})([^a-zA-Z0-9_]?[0-9]{6})([^a-zA-Z0-9_]?[0-9]{5})"
' Regex patterns for PANs above are broken into multiple parts (between the brackets)
' As such the when regex matches the first part of a PAN will fit into one of rMatch(k).SubMatches(#) where # is 0, 4, 8, 12, 16, 20 or 24.
' Visa start with a 4 and is 16 digits long. Typically the data entry pattern is four groups of four digits
' MasterCard start with a 5 and is 16 digits long. Typically the data entry pattern is four groups of four digits
' AmEx start with a 3 and is 15 digits long. Typically the pattern is 4-6-5, but data entry seems inconsistent
Dim strReplace As String: strReplace = ""
' Dim regEx As New RegExp ' if this line is used instead of the next 2, the MS VBS RegEx v5.5 needs to be enabled manually. The next 2 lines seem to do it from within the script
Dim regEx As Object
Set regEx = CreateObject("VBScript.RegExp")
Dim regEx As New RegExp
Dim strInput As String
Dim Myrange As Range
Dim NewPAN As String
Dim Aproblem As String
Dim Masked As Long
Dim Problems As Long
Dim Total As Long
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = strPattern ' sets the regex pattern to match the pattern above
End With
Set Myrange = Selection
MsgBox ("The macro will now start masking credit card numbers identified in the selected cells only. If entire columns are selected, each column will take 10-30 seconds to complete. Ditto for Rows.")
For Each cell In Myrange
Total = Total + 1
' Check that the cell is a likely candidate for holding a PAN, not just a long number
If strPattern <> "" _
And cell.HasFormula = False _
And Left(cell.NumberFormat, 1) <> "$" _
And Mid(cell.NumberFormat, 3, 1) <> "$" Then
' cell.NumberFormat = "#"
strInput = cell.Value
' Depending on the data matching the regex pattern, fix it
If regEx.Test(strInput) Then
Set rMatch = regEx.Execute(strInput)
For k = 0 To rMatch.Count - 1
toReplace = rMatch(k).Value
' If the regex matched, replace the PAN based on its regex segment
Select Case 2
Case Is < Len(rMatch(k).SubMatches(0))
strReplace = rMatch(k).SubMatches(0) & "xxxxxxxx" & Trim(rMatch(k).SubMatches(3))
Masked = Masked + 1
Case Is < Len(rMatch(k).SubMatches(4))
strReplace = rMatch(k).SubMatches(4) & "xxxxxxxx" & Trim(rMatch(k).SubMatches(7))
Masked = Masked + 1
Case Is < Len(rMatch(k).SubMatches(8))
strReplace = rMatch(k).SubMatches(8) & "xxxxxxxx" & Trim(rMatch(k).SubMatches(11))
Masked = Masked + 1
Case Is < Len(rMatch(k).SubMatches(12))
strReplace = rMatch(k).SubMatches(12) & "xxxxxxxx" & Trim(rMatch(k).SubMatches(13))
Masked = Masked + 1
Case Is < Len(rMatch(k).SubMatches(16))
strReplace = rMatch(k).SubMatches(16) & "xxxxxxxx" & Trim(rMatch(k).SubMatches(19))
Masked = Masked + 1
Case Is < Len(rMatch(k).SubMatches(20))
strReplace = rMatch(k).SubMatches(20) & "xxxxxxxx" & Trim(rMatch(k).SubMatches(23))
Masked = Masked + 1
Case Is < Len(rMatch(k).SubMatches(24))
strReplace = rMatch(k).SubMatches(24) & "xxxxxxxx" & Trim(rMatch(k).SubMatches(26))
Masked = Masked + 1
Case Else
Aproblem = cell.Value
Problems = Problems + 1
' MsgBox (Aproblem) ' only needed when curios
End Select
If cell.Value <> Aproblem Then
cell.Value = Replace(strInput, toReplace, strReplace)
End If
Next k
Else
' Adds the cell value to a variable to allow the macro to move past the cell
' Once the macro is trusted not to loop forever, the message box can be removed
' MsgBox ("Problem. Regex fail? Bad data = " & Aproblem)
End If
End If
Next cell
' All done, tell the user
MsgBox ("Cardholder data is now masked" & vbCr & vbCr & "Total cells highlighted (including blanks) = " & Total & vbCr & "Cells masked = " & Masked & vbCr & "Possible problem cells = " & Problems & vbCr & "All other cells were ignored")
End Sub

Back from vacation. Here's a simple VBA function that will test for the LUHN algorithm. The argument is a string of the digits; the result is boolean.
It generates a checksum digit and compares that digit with the one in the digit string you feed it.
Option Explicit
Function Luhn(sNum As String) As Boolean
'modulus 10 algorithm for various numbers
Dim X As Long, I As Long, J As Long
For I = Len(sNum) - 1 To 1 Step -2
X = X + DoubleSumDigits(Mid(sNum, I, 1))
If I > 1 Then X = X + Mid(sNum, I - 1, 1)
Next I
If Right(sNum, 1) = (X * 9) Mod 10 Then
Luhn = True
Else
Luhn = False
End If
End Function
Function DoubleSumDigits(L As Long) As Long
Dim X As Long
X = L * 2
If X > 9 Then X = Val(Left(X, 1)) + Val(Right(X, 1))
DoubleSumDigits = X
End Function

In Vb how to add look behind in Regular Expression

As per my requirement i need to check three conditions using regular expression.
'Find the dot position
dotPos = InStr(editFormat, ".")
'Find whole number of digits
wholeNum = Left$(editFormat, dotPos - 1)
'Find decimal number of digits
decNum = Mid(editFormat, dotPos + 1, strLen - dotPos - 1)
regularExp = "^[-]{0,1}[0-9]{0," & wholeNum & "}\" & DecSep & "[0-9]{0," & decNum & "}$"
Here, i need to validate
1) whole value(Before dot)
2) decimal value(After dot)
3) whole length of the input.
wholeNum = 30
decNum = 20
Ex: The value is 123456789012345678901234567890.12345678901234567890
As per my code this two conditions are working Fine.
But i need to add one more condition is Total length should be 40.
Ex: Possible inputs for your example: ( (30.10) or (20.20) or (25.15) )
1) 123456789012345678901234567890.1234567890 (Total should be 40)
2) 12345678901234567890.12345678901234567890
3) 1234567890123456789012345.678901234567890
How to add that condition in my code.
Thanks.

I'd use a lookahead assertion:
regularExp = "^" & "(?=.{40}$)" & "[-]{0,1}[0-9]{0," & wholeNum & "}\" & DecSep & "[0-9]{0," & decNum & "}$"
assuming you're counting including the dot.

VB.Net help selecting first index of string with regex

I was wondering if there was a way I could start a selection from the Regex string i have in the below example
The below example works exactly how I want it too however if there is text that matches before it on another line it is choosing the wrong text and highlighting it.
What im wondering is if there is a way to get the start index of the regex string?
If Regex.IsMatch(Me.TextBox1.Text, "\b" + Regex.Escape("is") + "\b") Then
Me.TextBox1.SelectionStart = Me.TextBox1.Text.IndexOf("is")
Dim linenumber As Integer = Me.TextBox1.GetLineFromCharIndex(Me.TextBox1.Text.IndexOf("is"))
Me.TextBox1.SelectionLength = Me.TextBox1.Lines(linenumber).Length
Me.TextBox1.Focus()
Me.TextBox1.SelectedText = "is " & Me.TextBox2.Text

The System.Text.RegularExpression.Match object has a property which should help you here: Match.Index. Match.Index will tell you where the capture starts, and Match.Length tells you how long it is. Using those you could change your code to look like this:
If Regex.IsMatch(Me.TextBox1.Text, "\b" + Regex.Escape("is") + "\b") Then
Dim m as Match
m = Regex.Match(Me.TextBox1.Text, "\b" + Regex.Escape("is") + "\b")
Me.TextBox1.SelectionStart = m.Index
Dim linenumber As Integer = Me.TextBox1.GetLineFromCharIndex(m.Index)
Me.TextBox1.SelectionLength = Me.TextBox1.Lines(linenumber).Length
Me.TextBox1.Focus()
Me.TextBox1.SelectedText = "is " & Me.TextBox2.Text

Given a certain regex pattern, re-arrange and manipulate the contents of a cell

I'm completely new to Macro programming in Excel, and can't find any references to resolve my question.
I have a column of several words in Excel. If the word is a verb it has an asterisk immediately succeeding the word (without a space).
For example, part of the column looks like this:
accuse*
accustomed be*
acid
acidic
acquire*
acre
So I need to:
Find all instances in which a cell ends with an asterisk, delete that asterisk, and put the word "to" in front of the cell entry.
If the word has "be*" then I need to delete the asterisk, the word "be", and put "to be" in the front.
The finished column should look like this:
to accuse
to be accustomed
acid
acidic
to acquire
acre
Is this possible?

try this (select the list then run this)
Sub Demo()
Dim cl As Range
For Each cl In Selection.Cells
If Trim(cl) Like "*[*]" Then
Cl = trim(cl)
cl = "to " & Left(cl, Len(cl) - 1)
End If
Next
End Sub
Actual data includes training Linefeed character, sometimes a space before the *, and sometimes terms after the *
This version deals with these issues
Sub Demo()
Dim cl As Range
Dim r As Range
Dim str As String, i As Long
Set r = Selection
For Each cl In r.Cells
str = Trim(cl)
str = Replace(str, " *", "*")
If Asc(Right$(str, 1)) <= 31 Then
str = Trim(Left$(str, Len(str) - 1))
End If
If str Like "* be[*]*" Then
i = InStr(str, "*")
cl = "to be " & Left$(str, i - 4 & Mid$(str, i + 1))
ElseIf str Like "*[*]*" Then
i = InStr(str, "*")
cl = "to " & Left$(str, i - 1) & Mid$(str, i + 1)
End If
Next
End Sub

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Techniques for extracting regular expressions out of a labeled data set - regex

Related

Lookbehind in Access VBA

Excel VBA to find and mask PAN data using regex for PCI DSS compliance

In Vb how to add look behind in Regular Expression

VB.Net help selecting first index of string with regex

Given a certain regex pattern, re-arrange and manipulate the contents of a cell

Categories

Resources