Extract first floating point number from right in excel string - regex

I have an excel column full of strings, from which I am trying to extract one number.
Here is an example of a particular row (all rows follow this format):
5) something here 93 4. something else- here too(24+Mths) Y Y 249 5) 24+ Months 1) lots more rubbish text Y N some more rubbish text 24/04/2012 25/04/1999 0.263 10 L rubbish text 3521.37233 4130 rubbish text1041023.
I just need to extract the first decimal number from the right, in this case 3521.37233.
UPDATE: I tried using Text to Columns with space as a delimiter, but there are varying number of spaces between characters. Is there a way to delimit by any number of spaces?

This is a question that can be done swiftly by Regex. Unfortunately, Excel does not support Regex using Excel formula.
You can use the following UDF (add this to your workbook).
Usage:
if you want the last decimal number(i.e. 1st from the right): =StrRegex([cell reference],"[0-9]{1,}\.[0-9]{1,}",-1)
if you want all decimal numbers: =StrRegex([cell reference],"[0-9]{1,}\.[0-9]{1,}",0)
Function StrRegex(findIn As String, pattern As String, Optional matchID As Long = 1, Optional separator As String = ",", Optional ignoreCase As Boolean = False) As String ' matchID - 1-based, matchID=0 => return all
Application.Volatile (True)
Dim result As String
Dim allMatches As Object
Dim re As Object
Set re = CreateObject("vbscript.regexp")
Dim mc As Long
Dim i As Long
Dim j As Long
re.pattern = pattern
re.Global = True
re.ignoreCase = ignoreCase
Set allMatches = re.Execute(findIn)
mc = allMatches.count
If mc > 0 Then
If matchID > mc Then
result = CVErr(xlErrNA)
Else
If matchID > 0 Then
result = allMatches.Item(matchID - 1).Value
ElseIf matchID < 0 Then
result = allMatches.Item(mc + matchID).Value
Else
result = ""
For i = 0 To allMatches.count - 1
result = result & separator & allMatches.Item(i).Value
For j = 0 To allMatches.Item(i).submatches.count - 1
result = result & separator & allMatches.Item(i).submatches.Item(j)
Next
Next
If Len(result) <> 0 Then
result = Right(result, Len(result) - Len(separator))
End If
End If
End If
Else
result = ""
End If
StrRegex = result
End Function

For any interested in a native Excel function solution, if you have the FILTERXML function, you can use:
=FILTERXML("<t><s>" & SUBSTITUTE(A1," ","</s><s>")& "</s></t>","//s[number(.) = number(.) and contains(.,'.')][last()]")
The xPath looks for all nodes that are numeric, and also contain a dot, and then returns the last node that meets those specifications.
Note: If your Windows regional settings are using the dot as a thousands separator, this will not work as written. You would have to replace the . with your system decimal separator.

Related

Removing leading whitespace using VBA

I am trying to remove leading whitespace from a word " 00000000000000231647300000000002KK".
Below is my VBA code
Option Explicit
Sub myfunction()
Dim getarray, getarray1 As Variant
Dim Text As String
Dim RegularText
getarray = Sheets("Sheet1").Range("A1:A4").Value
getarray1 = getarray
Set RegularText = New regexp
RegularText.Global = True
RegularText.MultiLine = True
RegularText.Pattern = "(^\\s+)"
Text = CStr(getarray(1, 1))
getarray1(1, 1) = RegularText.Replace(getarray(1, 1), "")
Sheets("Sheet1").Range("B1:B4").Value = getarray1
End Sub
However above code fails to remove the leading whitespace from my word.
Below is the excel workbook with result and above code
https://easyupload.io/jv6n2p
If you could help to understand why my code is failing to remove leading whitespace, it will be very helpful.
Thanks for your time
There are a few things wrong with the original code.
RegularText.Pattern = "(^\\s+)"
Explanations from regex101.com.
(^\\s+) pattern:
Basically, the first backslash is escaping the second backslash. This tells the RegEx to treat the second \ as a normal character. (^\\s+) is grouping leading \s characters together not whitespace.
(^\s+) pattern:
RegularText.MultiLine = True
The MultiLine property indicates every line in a value should be searched not row in an array. This doesn't seem to be the intended result. So set it to false.
`RegularText.MultiLine = False`
Range("A1:A4").Value is 1 row by 4 columns and Range("B1:B4") is 1 column by 4 rows. In my examples I will use Range("A2:D2") for simplicity.
Sub RegExRemoveTrailingSpace()
Dim Data As Variant
Data = Sheets("Sheet1").Range("A1:A4").Value
Dim RegularText As New RegExp
RegularText.Global = False
RegularText.Pattern = "(^\s+)"
[b4] = RegularText.Replace([A1], "")
Dim r As Long, c As Long
For r = 1 To UBound(Data)
For c = 1 To UBound(Data, 2)
Data(r, c) = RegularText.Replace(Data(r, c), "")
Next
Next
Sheets("Sheet1").Range("A2:D2").Value = Data
End Sub
We could just use LTrim() to remove the leading spaces from the string.
Sub LTrimTrailingSpace()
Dim Data As Variant
Data = Sheets("Sheet1").Range("A1:A4").Value
Dim r As Long, c As Long
For r = 1 To UBound(Data)
For c = 1 To UBound(Data, 2)
Data(r, c) = LTrim(Data(r, c))
Next
Next
Sheets("Sheet1").Range("A2:D2").Value = Data
End Sub

Excel UDF for capturing numbers within characters

I have a variable text field sitting in cell A1 which contains the following:
Text;#Number;#Text;#Number
This format can keep repeating, but the pattern is always Text;#Number.
The numbers can vary from 1 digit to n digits (limit 7)
Example:
Original Value
MyName;#123;#YourName;#3456;#HisName;#78
Required value:
123, 3456, 78
The field is too variable for excel formulas from my understanding.
I tried using regexp but I am a beginner when it comes to coding. if you can break down the code with some explanation text, it would be much appreciated.
I have tried some of the suggestions below and they work perfectly. One more question.
Now that I can split the numbers from the text, is there any way to utilize the code below and add another layer, where we split the numbers into x cells.
For example: once we run the function, if we get 1234, 567 in the same cell, the function would put 1234 in cell B2, and 567 in cell C2. This would keep updating all cells in the same row until the string has exhausted all of the numbers that are retrieved from the function.
Thanks
This is the John Coleman's suggested method:
Public Function GetTheNumbers(st As String) As String
ary = Split(st, ";#")
GetTheNumbers = ""
For Each a In ary
If IsNumeric(a) Then
If GetTheNumbers = "" Then
GetTheNumbers = a
Else
GetTheNumbers = GetTheNumbers & ", " & a
End If
End If
Next a
End Function
If the pattern is fixed, and the location of the numbers never changes, you can assume the numbers will be located in the even places in the string. This means that in the array result of a split on the source string, you can use the odd indexes of the resulting array. For example in this string "Text;#Number;#Text;#Number" array indexes 1, 3 would be the numbers ("Text(0);#Number(1);#Text(2);#Number(3)"). I think this method is easier and safer to use if the pattern is indeed fixed, as it avoids the need to verify data types.
Public Function GetNums(src As String) As String
Dim arr
Dim i As Integer
Dim result As String
arr = Split(src, ";#") ' Split the string to an array.
result = ""
For i = 1 To UBound(arr) Step 2 ' Loop through the array, starting with the second item, and skipping one item (using Step 2).
result = result & arr(i) & ", "
Next
If Len(result) > 2 Then
GetNums = Left(result, Len(result) - 2) ' Remove the extra ", " at the end of the the result string.
Else
GetNums = ""
End If
End Function
The numbers can vary from 1 digit to n digits (limit 7)
None of the other responses seems to take the provided parameters into consideration so I kludged together a true regex solution.
Option Explicit
Option Base 0 '<~~this is the default but I've included it because it has to be 0
Function numsOnly(str As String, _
Optional delim As String = ", ")
Dim n As Long, nums() As Variant
Static rgx As Object, cmat As Object
'with rgx as static, it only has to be created once; beneficial when filling a long column with this UDF
If rgx Is Nothing Then
Set rgx = CreateObject("VBScript.RegExp")
End If
numsOnly = vbNullString
With rgx
.Global = True
.MultiLine = False
.Pattern = "[0-9]{1,7}"
If .Test(str) Then
Set cmat = .Execute(str)
'resize the nums array to accept the matches
ReDim nums(cmat.Count - 1)
'populate the nums array with the matches
For n = LBound(nums) To UBound(nums)
nums(n) = cmat.Item(n)
Next n
'convert the nums array to a delimited string
numsOnly = Join(nums, delim)
End If
End With
End Function
      
Regexp option that uses Replace
Sub Test()
Debug.Print StrOut("MyName;#123;#YourName;#3456;#HisName;#78")
End Sub
function
Option Explicit
Function StrOut(strIn As String) As String
Dim objRegex As Object
Set objRegex = CreateObject("vbscript.regexp")
With objRegex
.Pattern = "(^|.+?)(\d{1,7})"
.Global = True
If .Test(strIn) Then
StrOut = .Replace(strIn, "$2, ")
StrOut = Left$(StrOut, Len(StrOut) - 2)
Else
StrOut = "Nothing"
End If
End With
End Function

Keep trailing zeroes in regex matching formula

I have a function I've written to handle calculations of percent reductions with significant digits, and I'm having a problem with keeping trailing zeroes.
The function:
Function RegexReduction(IValue As Double, EValue As Double) As String
Dim TempPercent As Double
Dim TempString As String
Dim NumFormat As String
Dim DecPlaces As Long
Dim regex As Object
Dim rxMatches As Object
TempPercent = (1 - EValue / IValue)
NumFormat = "0"
Set regex = CreateObject("VBScript.RegExp")
With regex
.Pattern = "([^1-8])*[0-8]{1}[0-9]?"
.Global = False
End With
Set rxMatches = regex.Execute(CStr(TempPercent))
If rxMatches.Count <> 0 Then
TempString = rxMatches.Item(0)
DecPlaces = Len(Split(TempString, ".")(1)) - 2
If DecPlaces > 0 Then NumFormat = NumFormat & "." & String(DecPlaces, "0")
End If
RegexReduction = Format(TempPercent, NumFormat & "%")
End Function
This trims percentages to two digits after any leading zeroes or nines:
99.999954165% -> 99.99954%
34.564968% -> 35%
0.000516% -> 0.00052%
The one problem I've found isn't related to the regex, but to Excel's rounding:
99.50% -> 99.5%
Is there a solution that will save trailing zeroes that could be implemented here?
I suggest a version of your function that uses LTrim in combination with Replace instead of the (costly) regular expression to calculate the value of DecPlaces. The calculation of DecPlaces has become a "one-liner".
The rest of the code is the same except for the additional call to CDec to avoid CStr from returning a scientific notation (like 1.123642E-12) when the value is tiny.
Function Reduction(IValue As Double, EValue As Double) As String
Dim TempPercent As Double
Dim TempString As String
Dim NumFormat As String
Dim DecPlaces As Long
TempPercent = (1 - EValue / IValue)
' Apply CDec so tiny numbers do not get scientific notation
TempString = CStr(CDec(TempPercent))
' Count number of significant digits present by trimming away all other chars,
' and subtract from total length to get number of decimals to display
DecPlaces = Len(TempString) - 2 - _
Len(LTrim(Replace(Replace(Replace(TempString, "0"," "), "9"," "), "."," ")))
' Prepare format of decimals, if any
If DecPlaces > 0 Then NumFormat = "." & String(DecPlaces, "0")
' Apply format
Reduction = Format(TempPercent, "0" & NumFormat & "%")
End Function
It is assumed that TempPercent evaluates to a value between 0 and 1.
Comments on your code
You wrote:
The one problem I've found isn't related to the regex, but to Excel's rounding:
99.50% -> 99.5%
This is actually not related to Excel's rounding. In your code the following
DecPlaces = Len(Split(TempString, ".")(1)) - 2
will evaluate to Len(Split("0.995", ".")(1)) - 2, which is 1, and so the format you apply is 0.0%, explaining the output you get.
Also realise that although you have a capturing group in your regular expression, you do not actually use it. rxMatches.Item(0) will give you the complete matched string, not only the match with the capture group.
You apply a number format of 0% for the case the regular expression does not yield a match. Any number that has no other digits than 0 and 9 will not match. For instance 0.099 should be displayed with format 0.000% to give 9.900, but the format used is 0% as you have no Else block treating this case.
Finally, CStr can turn numbers into scientific notation, which will give wrong results as well. It seems with CDec this can be avoided.
Here is a UDF that attempts to 'read' the incoming raw (non-percentage) value in order to determine the number of decimal places to include.
Function udf_Specific_Scope(rng As Range)
Dim i As Long, str As String
str = rng.Value2 'raw value is 0.999506 for 99.9506%
For i = 1 To Len(str) - 1
If Asc(Mid(str, i, 1)) <> 48 And _
Asc(Mid(str, i, 1)) <> 57 And _
Asc(Mid(str, i, 1)) <> 46 Then _
Exit For
Next i
If InStr(1, str, Chr(46)) < i - 1 Then
udf_Specific_Scope = Val(Format(rng.Value2 * 100, "0." & String(i - 3, Chr(48)))) & Chr(37)
Else
udf_Specific_Scope = Format(rng.Value2, "0%")
End If
End Function
    
The disadvantage here is removing the numerical value from the cell entry but that mirrors your original RegEx method. Ideally, something like the above could be written as a sub based on the Application.Selection property. Just highlight (aka Select) some cells, run the sub and it assigns a cell number format with the correct number of decimals to each in the selection.

Excel VBA to find and mask PAN data using regex for PCI DSS compliance

Because most of the tools to discover credit card data in file systems does no more that list the suspicious files, tools are needed to mask any data in files that must be retained.
For excel files, where loads of credit card data may exist, I figure a macro that detects credit card data in the selected column/row using regex and replaces the middle 6-8 digits with Xs would be useful to many. Sadly, I'm not a guru in the regex macro space.
The below basically works with regex for 3 card brands only, and works if the PAN is in a cell with other data (e.g. comments fields)
The below code works, but could be improved. It would be good to improve the regex to make it work for more/all card brands and reduce false-positives by including a LUHN algorithm check.
Improvements/Problems remaining :
Match all card brand's PANs with expanded regex
Include Luhn algorithm checking (FIXED - good idea Ron)
Improve the Do While logic (FIXED by stribizhev)
Even better handling of cells that don't contain PANs (FIXED)
Here's what I have so far which seems to be working ok for AmEx, Visa and Mastercard:
Sub PCI_mask_card_numbers()
' Written to mask credit card numbers in excel files in accordance with PCI DSS.
' Highlight the credit card data in the Excel sheet, then run this macro.
Dim strPattern As String: strPattern = "([4][0-9]{3})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{4})|" & _
"([5][0-9]{3})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{4})|" & _
"([3][0-9]{2})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{4})|" & _
"([3][0-9]{3})([^a-zA-Z0-9_]?[0-9]{3})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{4})|" & _
"([3][0-9]{3})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{3})([^a-zA-Z0-9_]?[0-9]{4})|" & _
"([3][0-9]{3})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{4})([^a-zA-Z0-9_]?[0-9]{3})|" & _
"([3][0-9]{3})([^a-zA-Z0-9_]?[0-9]{6})([^a-zA-Z0-9_]?[0-9]{5})"
' Regex patterns for PANs above are broken into multiple parts (between the brackets)
' As such the when regex matches the first part of a PAN will fit into one of rMatch(k).SubMatches(#) where # is 0, 4, 8, 12, 16, 20 or 24.
' Visa start with a 4 and is 16 digits long. Typically the data entry pattern is four groups of four digits
' MasterCard start with a 5 and is 16 digits long. Typically the data entry pattern is four groups of four digits
' AmEx start with a 3 and is 15 digits long. Typically the pattern is 4-6-5, but data entry seems inconsistent
Dim strReplace As String: strReplace = ""
' Dim regEx As New RegExp ' if this line is used instead of the next 2, the MS VBS RegEx v5.5 needs to be enabled manually. The next 2 lines seem to do it from within the script
Dim regEx As Object
Set regEx = CreateObject("VBScript.RegExp")
Dim regEx As New RegExp
Dim strInput As String
Dim Myrange As Range
Dim NewPAN As String
Dim Aproblem As String
Dim Masked As Long
Dim Problems As Long
Dim Total As Long
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = strPattern ' sets the regex pattern to match the pattern above
End With
Set Myrange = Selection
MsgBox ("The macro will now start masking credit card numbers identified in the selected cells only. If entire columns are selected, each column will take 10-30 seconds to complete. Ditto for Rows.")
For Each cell In Myrange
Total = Total + 1
' Check that the cell is a likely candidate for holding a PAN, not just a long number
If strPattern <> "" _
And cell.HasFormula = False _
And Left(cell.NumberFormat, 1) <> "$" _
And Mid(cell.NumberFormat, 3, 1) <> "$" Then
' cell.NumberFormat = "#"
strInput = cell.Value
' Depending on the data matching the regex pattern, fix it
If regEx.Test(strInput) Then
Set rMatch = regEx.Execute(strInput)
For k = 0 To rMatch.Count - 1
toReplace = rMatch(k).Value
' If the regex matched, replace the PAN based on its regex segment
Select Case 2
Case Is < Len(rMatch(k).SubMatches(0))
strReplace = rMatch(k).SubMatches(0) & "xxxxxxxx" & Trim(rMatch(k).SubMatches(3))
Masked = Masked + 1
Case Is < Len(rMatch(k).SubMatches(4))
strReplace = rMatch(k).SubMatches(4) & "xxxxxxxx" & Trim(rMatch(k).SubMatches(7))
Masked = Masked + 1
Case Is < Len(rMatch(k).SubMatches(8))
strReplace = rMatch(k).SubMatches(8) & "xxxxxxxx" & Trim(rMatch(k).SubMatches(11))
Masked = Masked + 1
Case Is < Len(rMatch(k).SubMatches(12))
strReplace = rMatch(k).SubMatches(12) & "xxxxxxxx" & Trim(rMatch(k).SubMatches(13))
Masked = Masked + 1
Case Is < Len(rMatch(k).SubMatches(16))
strReplace = rMatch(k).SubMatches(16) & "xxxxxxxx" & Trim(rMatch(k).SubMatches(19))
Masked = Masked + 1
Case Is < Len(rMatch(k).SubMatches(20))
strReplace = rMatch(k).SubMatches(20) & "xxxxxxxx" & Trim(rMatch(k).SubMatches(23))
Masked = Masked + 1
Case Is < Len(rMatch(k).SubMatches(24))
strReplace = rMatch(k).SubMatches(24) & "xxxxxxxx" & Trim(rMatch(k).SubMatches(26))
Masked = Masked + 1
Case Else
Aproblem = cell.Value
Problems = Problems + 1
' MsgBox (Aproblem) ' only needed when curios
End Select
If cell.Value <> Aproblem Then
cell.Value = Replace(strInput, toReplace, strReplace)
End If
Next k
Else
' Adds the cell value to a variable to allow the macro to move past the cell
' Once the macro is trusted not to loop forever, the message box can be removed
' MsgBox ("Problem. Regex fail? Bad data = " & Aproblem)
End If
End If
Next cell
' All done, tell the user
MsgBox ("Cardholder data is now masked" & vbCr & vbCr & "Total cells highlighted (including blanks) = " & Total & vbCr & "Cells masked = " & Masked & vbCr & "Possible problem cells = " & Problems & vbCr & "All other cells were ignored")
End Sub
Back from vacation. Here's a simple VBA function that will test for the LUHN algorithm. The argument is a string of the digits; the result is boolean.
It generates a checksum digit and compares that digit with the one in the digit string you feed it.
Option Explicit
Function Luhn(sNum As String) As Boolean
'modulus 10 algorithm for various numbers
Dim X As Long, I As Long, J As Long
For I = Len(sNum) - 1 To 1 Step -2
X = X + DoubleSumDigits(Mid(sNum, I, 1))
If I > 1 Then X = X + Mid(sNum, I - 1, 1)
Next I
If Right(sNum, 1) = (X * 9) Mod 10 Then
Luhn = True
Else
Luhn = False
End If
End Function
Function DoubleSumDigits(L As Long) As Long
Dim X As Long
X = L * 2
If X > 9 Then X = Val(Left(X, 1)) + Val(Right(X, 1))
DoubleSumDigits = X
End Function

Separating strings from numbers with Excel VBA

I need to
a) separate strings from numbers for a selection of cells
and
b) place the separated strings and numbers into different columns.
For example , Excel sheet is as follows:
A1 B1
100CASH etc.etc.
The result should be:
A1 B1 C1
100 CASH etc.etc.
Utilization of regular expressions will be useful, as there may be different cell formats,such as 100-CASH, 100/CASH, 100%CASH. Once the procedure is set up it won't be hard to use regular expressions for different variations.
I came across a UDF for extracting numbers from a cell. This can easily be modified to extract string or other types of data from cells simply changing the regular expression.
But what I need is not just a UDF but a sub procedure to split cells using regular expressions and place the separated data into separate columns.
I've also found a similar question in SU, however it isn't VBA.
See if this will work for you:
UPDATED 11/30:
Sub test()
Dim RegEx As Object
Dim strTest As String
Dim ThisCell As Range
Dim Matches As Object
Dim strNumber As String
Dim strText As String
Dim i As Integer
Dim CurrCol As Integer
Set RegEx = CreateObject("VBScript.RegExp")
' may need to be tweaked
RegEx.Pattern = "-?\d+"
' Get the current column
CurrCol = ActiveCell.Column
Dim lngLastRow As Long
lngLastRow = Cells(1, CurrCol).End(xlDown).Row
' add a new column & shift column 2 to the right
Columns(CurrCol + 1).Insert Shift:=xlToRight
For i = 1 To lngLastRow ' change to number of rows to search
Set ThisCell = ActiveSheet.Cells(i, CurrCol)
strTest = ThisCell.Value
If RegEx.test(strTest) Then
Set Matches = RegEx.Execute(strTest)
strNumber = CStr(Matches(0))
strText = Mid(strTest, Len(strNumber) + 1)
' replace original cell with number only portion
ThisCell.Value = strNumber
' replace cell to the right with string portion
ThisCell.Offset(0, 1).Value = strText
End If
Next
Set RegEx = Nothing
End Sub
How about:
Sub UpdateCells()
Dim rng As Range
Dim c As Range
Dim l As Long
Dim s As String, a As String, b As String
''Working with sheet1 and column C
With Sheet1
l = .Range("C" & .Rows.Count).End(xlUp).Row
Set rng = .Range("C1:C" & l)
End With
''Working with selected range from above
For Each c In rng.Cells
If c <> vbNullString Then
s = FirstNonNumeric(c.Value)
''Split the string into numeric and non-numeric, based
''on the position of first non-numeric, obtained above.
a = Mid(c.Value, 1, InStr(c.Value, s) - 1)
b = Mid(c.Value, InStr(c.Value, s))
''Put the two values on the sheet in positions one and two
''columns further along than the test column. The offset
''can be any suitable value.
c.Offset(0, 1) = a
c.Offset(0, 2) = b
End If
Next
End Sub
Function FirstNonNumeric(txt As String) As String
With CreateObject("VBScript.RegExp")
.Pattern = "[^0-9]"
FirstNonNumeric = .Execute(txt)(0)
End With
End Function