Keep trailing zeroes in regex matching formula - regex

I have a function I've written to handle calculations of percent reductions with significant digits, and I'm having a problem with keeping trailing zeroes.
The function:
Function RegexReduction(IValue As Double, EValue As Double) As String
Dim TempPercent As Double
Dim TempString As String
Dim NumFormat As String
Dim DecPlaces As Long
Dim regex As Object
Dim rxMatches As Object
TempPercent = (1 - EValue / IValue)
NumFormat = "0"
Set regex = CreateObject("VBScript.RegExp")
With regex
.Pattern = "([^1-8])*[0-8]{1}[0-9]?"
.Global = False
End With
Set rxMatches = regex.Execute(CStr(TempPercent))
If rxMatches.Count <> 0 Then
TempString = rxMatches.Item(0)
DecPlaces = Len(Split(TempString, ".")(1)) - 2
If DecPlaces > 0 Then NumFormat = NumFormat & "." & String(DecPlaces, "0")
End If
RegexReduction = Format(TempPercent, NumFormat & "%")
End Function
This trims percentages to two digits after any leading zeroes or nines:
99.999954165% -> 99.99954%
34.564968% -> 35%
0.000516% -> 0.00052%
The one problem I've found isn't related to the regex, but to Excel's rounding:
99.50% -> 99.5%
Is there a solution that will save trailing zeroes that could be implemented here?

I suggest a version of your function that uses LTrim in combination with Replace instead of the (costly) regular expression to calculate the value of DecPlaces. The calculation of DecPlaces has become a "one-liner".
The rest of the code is the same except for the additional call to CDec to avoid CStr from returning a scientific notation (like 1.123642E-12) when the value is tiny.
Function Reduction(IValue As Double, EValue As Double) As String
Dim TempPercent As Double
Dim TempString As String
Dim NumFormat As String
Dim DecPlaces As Long
TempPercent = (1 - EValue / IValue)
' Apply CDec so tiny numbers do not get scientific notation
TempString = CStr(CDec(TempPercent))
' Count number of significant digits present by trimming away all other chars,
' and subtract from total length to get number of decimals to display
DecPlaces = Len(TempString) - 2 - _
Len(LTrim(Replace(Replace(Replace(TempString, "0"," "), "9"," "), "."," ")))
' Prepare format of decimals, if any
If DecPlaces > 0 Then NumFormat = "." & String(DecPlaces, "0")
' Apply format
Reduction = Format(TempPercent, "0" & NumFormat & "%")
End Function
It is assumed that TempPercent evaluates to a value between 0 and 1.
Comments on your code
You wrote:
The one problem I've found isn't related to the regex, but to Excel's rounding:
99.50% -> 99.5%
This is actually not related to Excel's rounding. In your code the following
DecPlaces = Len(Split(TempString, ".")(1)) - 2
will evaluate to Len(Split("0.995", ".")(1)) - 2, which is 1, and so the format you apply is 0.0%, explaining the output you get.
Also realise that although you have a capturing group in your regular expression, you do not actually use it. rxMatches.Item(0) will give you the complete matched string, not only the match with the capture group.
You apply a number format of 0% for the case the regular expression does not yield a match. Any number that has no other digits than 0 and 9 will not match. For instance 0.099 should be displayed with format 0.000% to give 9.900, but the format used is 0% as you have no Else block treating this case.
Finally, CStr can turn numbers into scientific notation, which will give wrong results as well. It seems with CDec this can be avoided.

Here is a UDF that attempts to 'read' the incoming raw (non-percentage) value in order to determine the number of decimal places to include.
Function udf_Specific_Scope(rng As Range)
Dim i As Long, str As String
str = rng.Value2 'raw value is 0.999506 for 99.9506%
For i = 1 To Len(str) - 1
If Asc(Mid(str, i, 1)) <> 48 And _
Asc(Mid(str, i, 1)) <> 57 And _
Asc(Mid(str, i, 1)) <> 46 Then _
Exit For
Next i
If InStr(1, str, Chr(46)) < i - 1 Then
udf_Specific_Scope = Val(Format(rng.Value2 * 100, "0." & String(i - 3, Chr(48)))) & Chr(37)
Else
udf_Specific_Scope = Format(rng.Value2, "0%")
End If
End Function
    
The disadvantage here is removing the numerical value from the cell entry but that mirrors your original RegEx method. Ideally, something like the above could be written as a sub based on the Application.Selection property. Just highlight (aka Select) some cells, run the sub and it assigns a cell number format with the correct number of decimals to each in the selection.

Related

Extract first floating point number from right in excel string

I have an excel column full of strings, from which I am trying to extract one number.
Here is an example of a particular row (all rows follow this format):
5) something here 93 4. something else- here too(24+Mths) Y Y 249 5) 24+ Months 1) lots more rubbish text Y N some more rubbish text 24/04/2012 25/04/1999 0.263 10 L rubbish text 3521.37233 4130 rubbish text1041023.
I just need to extract the first decimal number from the right, in this case 3521.37233.
UPDATE: I tried using Text to Columns with space as a delimiter, but there are varying number of spaces between characters. Is there a way to delimit by any number of spaces?
This is a question that can be done swiftly by Regex. Unfortunately, Excel does not support Regex using Excel formula.
You can use the following UDF (add this to your workbook).
Usage:
if you want the last decimal number(i.e. 1st from the right): =StrRegex([cell reference],"[0-9]{1,}\.[0-9]{1,}",-1)
if you want all decimal numbers: =StrRegex([cell reference],"[0-9]{1,}\.[0-9]{1,}",0)
Function StrRegex(findIn As String, pattern As String, Optional matchID As Long = 1, Optional separator As String = ",", Optional ignoreCase As Boolean = False) As String ' matchID - 1-based, matchID=0 => return all
Application.Volatile (True)
Dim result As String
Dim allMatches As Object
Dim re As Object
Set re = CreateObject("vbscript.regexp")
Dim mc As Long
Dim i As Long
Dim j As Long
re.pattern = pattern
re.Global = True
re.ignoreCase = ignoreCase
Set allMatches = re.Execute(findIn)
mc = allMatches.count
If mc > 0 Then
If matchID > mc Then
result = CVErr(xlErrNA)
Else
If matchID > 0 Then
result = allMatches.Item(matchID - 1).Value
ElseIf matchID < 0 Then
result = allMatches.Item(mc + matchID).Value
Else
result = ""
For i = 0 To allMatches.count - 1
result = result & separator & allMatches.Item(i).Value
For j = 0 To allMatches.Item(i).submatches.count - 1
result = result & separator & allMatches.Item(i).submatches.Item(j)
Next
Next
If Len(result) <> 0 Then
result = Right(result, Len(result) - Len(separator))
End If
End If
End If
Else
result = ""
End If
StrRegex = result
End Function
For any interested in a native Excel function solution, if you have the FILTERXML function, you can use:
=FILTERXML("<t><s>" & SUBSTITUTE(A1," ","</s><s>")& "</s></t>","//s[number(.) = number(.) and contains(.,'.')][last()]")
The xPath looks for all nodes that are numeric, and also contain a dot, and then returns the last node that meets those specifications.
Note: If your Windows regional settings are using the dot as a thousands separator, this will not work as written. You would have to replace the . with your system decimal separator.

Extract largest numeric sequence from string (regex, or?)

I have strings similar to the following:
4123499-TESCO45-123
every99999994_54
And I want to extract the largest numeric sequence in each string, respectively:
4123499
99999994
I have previously tried regex (I am using VB6)
Set rx = New RegExp
rx.Pattern = "[^\d]"
rx.Global = True
StringText = rx.Replace(StringText, "")
Which gets me partway there, but it only removes the non-numeric values, and I end up with the first string looking like:
412349945123
Can I find a regex that will give me what I require, or will I have to try another method? Essentially, my pattern would have to be anything that isn't the longest numeric sequence. But I'm not actually sure if that is even a reasonable pattern. Could anyone with a better handle of regex tell me if I am going down a rabbit hole? I appreciate any help!
You cannot get the result by just a regex. You will have to extract all numeric chunks and get the longest one using other programming means.
Here is an example:
Dim strPattern As String: strPattern = "\d+"
Dim str As String: str = "4123499-TESCO45-123"
Dim regEx As New RegExp
Dim matches As MatchCollection
Dim match As Match
Dim result As String
With regEx
.Global = True
.MultiLine = False
.IgnoreCase = False
.Pattern = strPattern
End With
Set matches = regEx.Execute(str)
For Each m In matches
If result < Len(m.Value) Then result = m.Value
Next
Debug.Print result
The \d+ with RegExp.Global=True will find all digit chunks and then only the longest will be printed after all matches are processed in a loop.
That's not solvable with an RE on its own.
Instead you can simply walk along the string tracking the longest consecutive digit group:
For i = 1 To Len(StringText)
If IsNumeric(Mid$(StringText, i, 1)) Then
a = a & Mid$(StringText, i, 1)
Else
a = ""
End If
If Len(a) > Len(longest) Then longest = a
Next
MsgBox longest
(first result wins a tie)
If the two examples you gave, are of a standard where:
<long_number>-<some_other_data>-<short_number>
<text><long_number>_<short_number>
Are the two formats that the strings come in, there are some solutions.
However, if you are searching any string in any format for the longest number, these will not work.
Solution 1
([0-9]+)[_-].*
See the demo
In the first capture group, you should have the longest number for those 2 formats.
Note: This assumes that the longest number will be the first number it encounters with an underscore or a hyphen next to it, matching those two examples given.
Solution 2
\d{6,}
See the demo
Note: This assumes that the shortest number will never exceed 5 characters in length, and the longest number will never be shorter than 6 characters in length
Please, try.
Pure VB. No external libs or objects.
No brain-breaking regexp's patterns.
No string manipulations, so - speed. Superspeed. ~30 times faster than regexp :)
Easy transform on variouse needs.
For example, concatenate all digits from the source string to a single string.
Moreover, if target string is only intermediate step,
so it's possible to manipulate with numbers only.
Public Sub sb_BigNmb()
Dim sSrc$, sTgt$
Dim taSrc() As Byte, taTgt() As Byte, tLB As Byte, tUB As Byte
Dim s As Byte, t As Byte, tLenMin As Byte
tLenMin = 4
sSrc = "every99999994_54"
sTgt = vbNullString
taSrc = StrConv(sSrc, vbFromUnicode)
tLB = LBound(taSrc)
tUB = UBound(taSrc)
ReDim taTgt(tLB To tUB)
t = 0
For s = tLB To tUB
Select Case taSrc(s)
Case 48 To 57
taTgt(t) = taSrc(s)
t = t + 1
Case Else
If CBool(t) Then Exit For ' *** EXIT FOR ***
End Select
Next
If (t > tLenMin) Then
ReDim Preserve taTgt(tLB To (t - 1))
sTgt = StrConv(taTgt, vbUnicode)
End If
Debug.Print "'" & sTgt & "'"
Stop
End Sub
How to handle sSrc = "ev_1_ery99999994_54", please, make by yourself :)
.

Excel UDF for capturing numbers within characters

I have a variable text field sitting in cell A1 which contains the following:
Text;#Number;#Text;#Number
This format can keep repeating, but the pattern is always Text;#Number.
The numbers can vary from 1 digit to n digits (limit 7)
Example:
Original Value
MyName;#123;#YourName;#3456;#HisName;#78
Required value:
123, 3456, 78
The field is too variable for excel formulas from my understanding.
I tried using regexp but I am a beginner when it comes to coding. if you can break down the code with some explanation text, it would be much appreciated.
I have tried some of the suggestions below and they work perfectly. One more question.
Now that I can split the numbers from the text, is there any way to utilize the code below and add another layer, where we split the numbers into x cells.
For example: once we run the function, if we get 1234, 567 in the same cell, the function would put 1234 in cell B2, and 567 in cell C2. This would keep updating all cells in the same row until the string has exhausted all of the numbers that are retrieved from the function.
Thanks
This is the John Coleman's suggested method:
Public Function GetTheNumbers(st As String) As String
ary = Split(st, ";#")
GetTheNumbers = ""
For Each a In ary
If IsNumeric(a) Then
If GetTheNumbers = "" Then
GetTheNumbers = a
Else
GetTheNumbers = GetTheNumbers & ", " & a
End If
End If
Next a
End Function
If the pattern is fixed, and the location of the numbers never changes, you can assume the numbers will be located in the even places in the string. This means that in the array result of a split on the source string, you can use the odd indexes of the resulting array. For example in this string "Text;#Number;#Text;#Number" array indexes 1, 3 would be the numbers ("Text(0);#Number(1);#Text(2);#Number(3)"). I think this method is easier and safer to use if the pattern is indeed fixed, as it avoids the need to verify data types.
Public Function GetNums(src As String) As String
Dim arr
Dim i As Integer
Dim result As String
arr = Split(src, ";#") ' Split the string to an array.
result = ""
For i = 1 To UBound(arr) Step 2 ' Loop through the array, starting with the second item, and skipping one item (using Step 2).
result = result & arr(i) & ", "
Next
If Len(result) > 2 Then
GetNums = Left(result, Len(result) - 2) ' Remove the extra ", " at the end of the the result string.
Else
GetNums = ""
End If
End Function
The numbers can vary from 1 digit to n digits (limit 7)
None of the other responses seems to take the provided parameters into consideration so I kludged together a true regex solution.
Option Explicit
Option Base 0 '<~~this is the default but I've included it because it has to be 0
Function numsOnly(str As String, _
Optional delim As String = ", ")
Dim n As Long, nums() As Variant
Static rgx As Object, cmat As Object
'with rgx as static, it only has to be created once; beneficial when filling a long column with this UDF
If rgx Is Nothing Then
Set rgx = CreateObject("VBScript.RegExp")
End If
numsOnly = vbNullString
With rgx
.Global = True
.MultiLine = False
.Pattern = "[0-9]{1,7}"
If .Test(str) Then
Set cmat = .Execute(str)
'resize the nums array to accept the matches
ReDim nums(cmat.Count - 1)
'populate the nums array with the matches
For n = LBound(nums) To UBound(nums)
nums(n) = cmat.Item(n)
Next n
'convert the nums array to a delimited string
numsOnly = Join(nums, delim)
End If
End With
End Function
      
Regexp option that uses Replace
Sub Test()
Debug.Print StrOut("MyName;#123;#YourName;#3456;#HisName;#78")
End Sub
function
Option Explicit
Function StrOut(strIn As String) As String
Dim objRegex As Object
Set objRegex = CreateObject("vbscript.regexp")
With objRegex
.Pattern = "(^|.+?)(\d{1,7})"
.Global = True
If .Test(strIn) Then
StrOut = .Replace(strIn, "$2, ")
StrOut = Left$(StrOut, Len(StrOut) - 2)
Else
StrOut = "Nothing"
End If
End With
End Function

Excel VBA to find and mask PAN data using regex for PCI DSS compliance

Because most of the tools to discover credit card data in file systems does no more that list the suspicious files, tools are needed to mask any data in files that must be retained.
For excel files, where loads of credit card data may exist, I figure a macro that detects credit card data in the selected column/row using regex and replaces the middle 6-8 digits with Xs would be useful to many. Sadly, I'm not a guru in the regex macro space.
The below basically works with regex for 3 card brands only, and works if the PAN is in a cell with other data (e.g. comments fields)
The below code works, but could be improved. It would be good to improve the regex to make it work for more/all card brands and reduce false-positives by including a LUHN algorithm check.
Improvements/Problems remaining :
Match all card brand's PANs with expanded regex
Include Luhn algorithm checking (FIXED - good idea Ron)
Improve the Do While logic (FIXED by stribizhev)
Even better handling of cells that don't contain PANs (FIXED)
Here's what I have so far which seems to be working ok for AmEx, Visa and Mastercard:
Sub PCI_mask_card_numbers()
' Written to mask credit card numbers in excel files in accordance with PCI DSS.
' Highlight the credit card data in the Excel sheet, then run this macro.
Dim strPattern As String: strPattern = "([4][0-9]{3})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{4})|" & _
"([5][0-9]{3})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{4})|" & _
"([3][0-9]{2})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{4})|" & _
"([3][0-9]{3})([^a-zA-Z0-9_]?[0-9]{3})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{4})|" & _
"([3][0-9]{3})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{3})([^a-zA-Z0-9_]?[0-9]{4})|" & _
"([3][0-9]{3})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{3})|" & _
"([3][0-9]{3})([^a-zA-Z0-9_]?[0-9]{6})([^a-zA-Z0-9_]?[0-9]{5})"
' Regex patterns for PANs above are broken into multiple parts (between the brackets)
' As such the when regex matches the first part of a PAN will fit into one of rMatch(k).SubMatches(#) where # is 0, 4, 8, 12, 16, 20 or 24.
' Visa start with a 4 and is 16 digits long. Typically the data entry pattern is four groups of four digits
' MasterCard start with a 5 and is 16 digits long. Typically the data entry pattern is four groups of four digits
' AmEx start with a 3 and is 15 digits long. Typically the pattern is 4-6-5, but data entry seems inconsistent
Dim strReplace As String: strReplace = ""
' Dim regEx As New RegExp ' if this line is used instead of the next 2, the MS VBS RegEx v5.5 needs to be enabled manually. The next 2 lines seem to do it from within the script
Dim regEx As Object
Set regEx = CreateObject("VBScript.RegExp")
Dim regEx As New RegExp
Dim strInput As String
Dim Myrange As Range
Dim NewPAN As String
Dim Aproblem As String
Dim Masked As Long
Dim Problems As Long
Dim Total As Long
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = strPattern ' sets the regex pattern to match the pattern above
End With
Set Myrange = Selection
MsgBox ("The macro will now start masking credit card numbers identified in the selected cells only. If entire columns are selected, each column will take 10-30 seconds to complete. Ditto for Rows.")
For Each cell In Myrange
Total = Total + 1
' Check that the cell is a likely candidate for holding a PAN, not just a long number
If strPattern <> "" _
And cell.HasFormula = False _
And Left(cell.NumberFormat, 1) <> "$" _
And Mid(cell.NumberFormat, 3, 1) <> "$" Then
' cell.NumberFormat = "#"
strInput = cell.Value
' Depending on the data matching the regex pattern, fix it
If regEx.Test(strInput) Then
Set rMatch = regEx.Execute(strInput)
For k = 0 To rMatch.Count - 1
toReplace = rMatch(k).Value
' If the regex matched, replace the PAN based on its regex segment
Select Case 2
Case Is < Len(rMatch(k).SubMatches(0))
strReplace = rMatch(k).SubMatches(0) & "xxxxxxxx" & Trim(rMatch(k).SubMatches(3))
Masked = Masked + 1
Case Is < Len(rMatch(k).SubMatches(4))
strReplace = rMatch(k).SubMatches(4) & "xxxxxxxx" & Trim(rMatch(k).SubMatches(7))
Masked = Masked + 1
Case Is < Len(rMatch(k).SubMatches(8))
strReplace = rMatch(k).SubMatches(8) & "xxxxxxxx" & Trim(rMatch(k).SubMatches(11))
Masked = Masked + 1
Case Is < Len(rMatch(k).SubMatches(12))
strReplace = rMatch(k).SubMatches(12) & "xxxxxxxx" & Trim(rMatch(k).SubMatches(13))
Masked = Masked + 1
Case Is < Len(rMatch(k).SubMatches(16))
strReplace = rMatch(k).SubMatches(16) & "xxxxxxxx" & Trim(rMatch(k).SubMatches(19))
Masked = Masked + 1
Case Is < Len(rMatch(k).SubMatches(20))
strReplace = rMatch(k).SubMatches(20) & "xxxxxxxx" & Trim(rMatch(k).SubMatches(23))
Masked = Masked + 1
Case Is < Len(rMatch(k).SubMatches(24))
strReplace = rMatch(k).SubMatches(24) & "xxxxxxxx" & Trim(rMatch(k).SubMatches(26))
Masked = Masked + 1
Case Else
Aproblem = cell.Value
Problems = Problems + 1
' MsgBox (Aproblem) ' only needed when curios
End Select
If cell.Value <> Aproblem Then
cell.Value = Replace(strInput, toReplace, strReplace)
End If
Next k
Else
' Adds the cell value to a variable to allow the macro to move past the cell
' Once the macro is trusted not to loop forever, the message box can be removed
' MsgBox ("Problem. Regex fail? Bad data = " & Aproblem)
End If
End If
Next cell
' All done, tell the user
MsgBox ("Cardholder data is now masked" & vbCr & vbCr & "Total cells highlighted (including blanks) = " & Total & vbCr & "Cells masked = " & Masked & vbCr & "Possible problem cells = " & Problems & vbCr & "All other cells were ignored")
End Sub
Back from vacation. Here's a simple VBA function that will test for the LUHN algorithm. The argument is a string of the digits; the result is boolean.
It generates a checksum digit and compares that digit with the one in the digit string you feed it.
Option Explicit
Function Luhn(sNum As String) As Boolean
'modulus 10 algorithm for various numbers
Dim X As Long, I As Long, J As Long
For I = Len(sNum) - 1 To 1 Step -2
X = X + DoubleSumDigits(Mid(sNum, I, 1))
If I > 1 Then X = X + Mid(sNum, I - 1, 1)
Next I
If Right(sNum, 1) = (X * 9) Mod 10 Then
Luhn = True
Else
Luhn = False
End If
End Function
Function DoubleSumDigits(L As Long) As Long
Dim X As Long
X = L * 2
If X > 9 Then X = Val(Left(X, 1)) + Val(Right(X, 1))
DoubleSumDigits = X
End Function

VBA code for extracting 3 specific number patterns

I am working in excel and need VBA code to extract 3 specific number patterns. In column A I have several rows of strings which include alphabetical characters, numbers, and punctuation. I need to remove all characters except those found in a 13-digit number (containing only numbers), a ten-digit number (containing only numbers), or a 9-digit number immediately followed by an "x" character. These are isbn numbers.
The remaining characters should be separated by one, and only one, space. So, for the following string found in A1: "There are several books here, including 0192145789 and 9781245687456. Also, the book with isbn 045789541x is included. This book is one of 100000000 copies."
The output should be: 0192145789 9781245687456 045789541x
Note that the number 100000000 should not be included in the output because it does not match any of the three patterns mentioned above.
I'm not opposed to a excel formula solution as opposed to VBA, but I assumed that VBA would be cleaner. Thanks in advance.
Here's a VBA function that will do specifically what you've specified
Function ExtractNumbers(inputStr As String) As String
Dim outputStr As String
Dim bNumDetected As Boolean
Dim numCount As Integer
Dim numStart As Integer
numCount = 0
bNumDetected = False
For i = 1 To Len(inputStr)
If IsNumeric(Mid(inputStr, i, 1)) Then
numCount = numCount + 1
If Not bNumDetected Then
bNumDetected = True
bNumStart = i
End If
If (numCount = 9 And Mid(inputStr, i + 1, 1) = "x") Or _
numCount = 13 And Not IsNumeric(Mid(inputStr, i + 1, 1)) Or _
numCount = 10 And Not IsNumeric(Mid(inputStr, i + 1, 1)) Then
If numCount = 9 Then
outputStr = outputStr & Mid(inputStr, bNumStart, numCount) & "x "
Else
outputStr = outputStr & Mid(inputStr, bNumStart, numCount) & " "
End If
End If
Else
numCount = 0
bNumDetected = False
End If
Next i
ExtractNumbers = Trim(outputStr)
End Function
It's nothing fancy, just uses string functions to goes through your string one character at a time looking for sections of 9 digit numbers ending with x, 10 digit numbers and 13 digit numbers and extracts them into a new string.
It's a UDF so you can use it as a formula in your workbook