Need finer tuning on regular expression with html - regex

What I want to do is to figure out how you use a regular expression to extract the inner most item from a HTML tag collection. That is:
TARGET TEXT
Function FindInnerHtml(Work As String) As String
Dim Results As String, myRegExp, myMatches As Object, thisMatch As Object
Let myRegExp = New RegExp
myRegExp.IgnoreCase = True
myRegExp.Global = True
myRegExp.Pattern = ">(.*?)<"
Set myMatches = myRegExp.Execute(Work)
If (myMatches.Count = 0) Then
Results = myMatches(0)
Results = Replace$(Replace$(Results, ">", ""), "<", "")
End If
FindInnerHtml = Results
End Function
What I do get from the function is the inner HTML, that is the target text, what I would rather be able to do is to ensure I'm not in need of adding that double replace$() to clean up the results.

It's crude and fails miserably for edge cases but something like this could work:
<[a-zA-Z]{1}[a-zA-Z\d]*>([^><]*)</[a-zA-Z]{1}[a-zA-Z\d]*>
$1 will contain the inner text
https://regex101.com/r/iuLdJV/3

Related

How to create a regex VBA macro for GIIN format validation

I'm trying to create a macro that will verify data in one column and then let me know if they are correctly formatted in the next column. I am very new to VBA so I apologize if my code is messy.
The format I am trying to verify is ABC123.AB123.AB.123 -- The first two sections can contain letters/numbers, the third section only letters, and the last section only numbers.
Any guidance would be greatly appreciated!
Function ValidGIIN(myGIIN As String) As String
Dim regExp As Object
Set regExp = CreateObject("VBScript.Regexp")
If Len(myGIIN) Then
.Global = True
.IgnoreCase = True
.Pattern = "[a-zA-Z0-9_][a-zA-Z0-9_][a-zA-Z0-9_][a-zA-Z0-9_][a-zA-Z0-9_][a-zA-Z0-9_][.][a-zA-Z0-9_][a-zA-Z0-9_][a-zA-Z0-9_][a-zA-Z0-9_][a-zA-Z0-9_][.][a-zA-z_][a-zA-z_][.][0-9][0-9][0-9]"
End With
If regExp.Test(myGIIN) = True Then
ValidGIIN = "Valid"
Else
ValidGIIN = "Invalid"
End If
End If
Set regExp = Nothing
End Function
Try the following pattern
[a-zA-Z0-9]{6}\.[a-zA-Z0-9]{5}\.[A-Za-z]{2}\.\d{3}
You could call your function in a loop over cells in a column and use offset(0,1) to write result to next column to right.

Excel VBA using RegEx for Conditional Formating

I have an Excel 2010 VBA macro that does some conditional formatting over a select area of a spreadsheet. As an example the following snippet searches for a text pattern then colors the cell:
Selection.FormatConditions.Add Type:=xlTextString, String:="TextToMatch", _
TextOperator:=xlContains Selection.FormatConditions(Selection.FormatConditions.Count).SetFirstPriority
With Selection.FormatConditions(1).Interior
.PatternColorIndex = xlAutomatic
.ColorIndex = 36
.TintAndShade = 0
End With
Selection.FormatConditions(1).StopIfTrue = False
What I would like to add is to match against a regular expression TN[0-9]. A simple match of the string TN followed by a digit.
I have created the RegExp obect:
Dim regEx As Object
Set regEx = CreateObject("VBScript.RegExp")
With regEx
.Pattern = "TN[0-9]"
End With
However I have not figured out how to apply this to the Selection.
As always, thank you for your assistance.
I would recommend using a Static type object for your VBScript.RegExp object.
Cut the range passed into the function down to the Worksheet.UsedRange property. This allows a selection of full columns without calculating empty rows/columns.
Option Explicit
Sub createCFR()
With Selection
'cut Selection down to the .UsedRange so that full row or full
'column references do not use undue calculation
With Intersect(.Cells, .Cells.Parent.UsedRange)
.FormatConditions.Delete
With .FormatConditions.Add(Type:=xlExpression, Formula1:="=myCFR(" & .Cells(1).Address(0, 0) & ")")
.SetFirstPriority
With .Interior
.PatternColorIndex = xlAutomatic
.ColorIndex = 36
.TintAndShade = 0
End With
.StopIfTrue = False
End With
End With
End With
End Sub
Function myCFR(rng As Range)
Static rgx As Object
'with rgx as static, it only has to be created once
'this is beneficial when filling a long column with this UDF
If rgx Is Nothing Then
Set rgx = CreateObject("VBScript.RegExp")
End If
'make sure rng is a single cell
Set rng = rng.Cells(1, 1)
With rgx
.Global = True
.MultiLine = True
.Pattern = "TN[0-9]"
myCFR = .Test(rng.Value2)
End With
End Function
Depending on your Selection, you may need to modify the parameters of the Range.Address property used to create the CFR; e.g. $A1 would be .Address(1, 0).
In the following image, B2:B7 contain =myCFR(A2) filled down to proof the UDF.

Strip HTML tags from string, keep specific

I'm trying to strip all html tags from string and keep only specific (keep the tag and the attributes),
I have this:
set objRegExp = new RegExp
with objRegExp
.Pattern = "<^((b)|(i)|(em)|(strong)|(br)|(img))>.*</.*>"
.Global = True
end with
and using:
objRegExp.replace(request.form("content"), "")
doesn't change anything.
I need this for a forum that I build, which supports WYSIWYG editor and I want to prevent xss & sql injections.
To strip all HTML Tags:
Public Function RegexAllHtml(strValue)
Set RegularExpressionObject = New RegExp
With RegularExpressionObject
.Pattern = "<(.|\n)+?>"
.IgnoreCase = True
.Global = True
End With
Dim strResult: strResult = RegularExpressionObject.Replace(strValue, " ")
Set RegularExpressionObject = Nothing
RegexAllHtml = strResult
End Function
To remove specific tags (eg. SPAN) you could use something like:
<SPAN[^><]*>|<.SPAN[^><]*>
Or to keep specific tags (eg. b an bold): <(?!/?(?:strong|b)\b)[^>]*>
BTW: Most WYSIWG editors let you configure which tags are not safe and those then are removed before saving the content! See for example CKEditor: http://docs.ckeditor.com/#!/api/CKEDITOR.config-cfg-allowedContent
Function RegexAllHtml... is...
If the string contains the string "<spec...>" it will be removed. I don't think it's a valid function.
Should we warn users not to use "<" and ">" individually?

Regular expression replace body content

I tried the following to replace all the text content in the current open document with numeric zero, but it doesn't work
Set objWdDoc = Word.Application.ActiveDocument
Set objWdRange = objWdDoc.Content
Dim re As New RegExp
re.Global = True
re.Pattern = "[a-z]"
re.IgnoreCase = True
objWdRange = re.Replace(objWdRange, "0")
Can anyone suggest a working method?
Assuming you have referenced microsoft vbscript regular expressions
objWdRange.Text = re.Replace(objWdRange, "0")
Will work, although you will of course lose any formatting.
You can also use the built-in search/replace which has limited support to find digits/characters. Record a macro of yourself doing this and you can examine the code.

Microsoft office Access `LIKE` VS `RegEx`

I have been having trouble with the Access key term LIKE and it's use. I want to use the following RegEx (Regular Expression) in query form as a sort of "verfication rule" where the LIKE operator filters my results:
"^[0]{1}[0-9]{8,9}$"
How can this be accomplished?
I know you were not asking about the VBA, but it maybe you will give it a chance
If you open a VBA project, insert new module, then pick Tools -> References and add a reference to Microsoft VBScript Regular Expressions 5.5. Given that pate the code below to the newly inserted module.
Function my_regexp(ByRef sIn As String, ByVal mypattern As String) As String
Dim r As New RegExp
Dim colMatches As MatchCollection
With r
.Pattern = mypattern
.IgnoreCase = True
.Global = False
.MultiLine = False
Set colMatches = .Execute(sIn)
End With
If colMatches.Count > 0 Then
my_regexp = colMatches(0).Value
Else
my_regexp = ""
End If
End Function
Now you may use the function above in your SQL queries. So your question would be now solved by invoking
SELECT my_regexp(some_variable, "^[0]{1}[0-9]{8,9}$") FROM some_table
if will return empty string if nothing is matched.
Hope you liked it.
I don't think Access allows regex matches (except in VBA, but that's not what you're asking). The LIKE operator doesn't even support alternation.
Therefore you need to split it up into two expressions.
... WHERE (Blah LIKE "0#########") OR (Blah LIKE "0########")
(# means "a single digit" in Access).