Remove vowels,white spaces and duplicate characters - regex

I'm trying to trim a string and remove any vowel and white space and duplicate characters.
Here's the code I'm using
Function TrimString(strString As String) As String
Dim objRegex As Object
Set objRegex = CreateObject("vbscript.regexp")
With objRegex
.Global = True
.IgnoreCase = True
.Pattern = "[aeiou\s]"
TrimString = .Replace(strString, vbNullString)
.Pattern = "(.)\1+"
TrimString = .Replace(TrimString, "$1")
End With
End Function
Is there a way to combine both patterns instead of doing this in 2 steps?
Thank you in advance.

This would work:
With objRegex
.Global = True
.IgnoreCase = True
.Pattern = ".*?([^aeiou\s]).*?"
TrimString = .Replace(TrimString, "$1$1")
End With
I'm not familiar with VBA but if there is a way to just return matches instead of replacing them in the original string then you could use the following pattern
[^aeiou\s]
And return this:
$&$&

You have two replacements:
Removing [aeiou\s] matches, e.g. niarararrrrreghtatt turns into nrrrrrrrghttt
Replacing each chunk of identical chars with the first occurrence turns nrrrrrrrghttt into nrght.
That means, you need to match the first pattern both as a separate alternative and as a "filler" between the identical chars.
The pattern you may use is
.pattern = "[aeiou\s]+|([^aeiou\s])(?:[aeiou\s]*\1)+"
TrimString = .Replace(strString, "$1")
See this regex demo.
Details
[aeiou\s]+ - 1+ vowels or whitespace chars
| - or
([^aeiou\s]) - Capturing group 1: any char other than a vowel or a whitespace char
(?:[aeiou\s]*\1)+ - 1 or more occurrences of:
[aeiou\s]* - 0+ vowel or whitespace chars
\1 - backreference to Group 1, its value
Note that . is changed into [^aeiou\s] since the opposite has already been handled with the first alternation branch.

Related

How to write a regular expression that includes numbers [0-9] and a defined word?

I'm using VBA and struggling to make a regex.replace function to clean my string cells
Example: "Foo World 4563"
What I want: "World"
by replacing the numbers and the word "Foo"
Another example: "Hello World 435 Foo", I want "Hello World"
This is what my code looks like so far:
Public Function Replacement(sInput) As String
Dim regex As New RegExp
With regex
.Global = True
.IgnoreCase = True
End With
regex.Pattern = "[0-9,()/-]+\bfoo\b"
Replacement = regex.Replace(sInput, "")
End Function
You can use
Function Replacement(sInput) As String
Dim regex As New regExp
With regex
.Global = True
.IgnoreCase = True
End With
regex.Pattern = "\s*(?:\bfoo\b|\d+)"
Replacement = Trim(regex.Replace(sInput, ""))
End Function
See the regex demo. Excel test:
Details:
\s* - zero or more whitespaces
(?:\bfoo\b|\d+) - either a whole word foo or one or more digits.
Note the use of Trim(), it is necessary to remove leading/trailing spaces that may remain after the replacement.
My two cents, capturing preceding whitespace chars when present trying to prevent possible false positives:
(^|\s+)(?:foo|\d+)(?=\s+|$)
See an online demo.
(^|\s+) - 1st Capture group to assert position is preceded by whitespace or directly at start of string;
(?:foo|\d+) - Non-capture group with the alternation between digits or 'foo';
(?=\s+|$) - Positive lookahead to assert position is followed by whitespace or end-line anchor.
Sub Test()
Dim arr As Variant: arr = Array("Foo World 4563", "Hello World 435 Foo", "There is a 99% chance of false positives which is foo-bar!")
For Each el In arr
Debug.Print Replacement(el)
Next
End Sub
Public Function Replacement(sInput) As String
With CreateObject("vbscript.regexp")
.Global = True
.IgnoreCase = True
.Pattern = "(^|\s+)(?:foo|\d+)(?=\s+|$)"
Replacement = Application.Trim(.Replace(sInput, "$1"))
End With
End Function
Print:
World
Hello World
There is a 99% chance of false positives which is foo-bar!
Here Application.Trim() does take care of multiple whitespace chars left inside your string.

How to I caputure only part of the matched string? Non capturing group

Below I has a sample test case where I want to just grab the Saturday value if the word Blah appears before it. Below is what I got, but for some reason I end up getting "Blah" included. Any help would be great. Thanks!
Sub regex_practice()
Dim pstring As String
pstring = "foo" & vbCrLf & "Blah" & vbCrLf & vbCrLf & "Saturday"
'MsgBox pstring
Dim regex As Object
Set regex = CreateObject("VBScript.RegExp")
With regex
.Pattern = "(?:Blah)((.|\n)*)Saturday"
.Global = True 'If False, would replace only first
End With
Set matches = regex.Execute(pstring)
Of course. A non-capturing group is included in the overall match. What you might be looking for is to grab the appropriate capturing group here.
Change
With regex
.Pattern = "(?:Blah)((.|\n)*)Saturday"
.Global = True 'If False, would replace only first
End With
To
With regex
.Pattern = "Blah[\s\S]*?(Saturday)"
.Global = True 'If False, would replace only first
End With
Then use .SubMatches:
If regex.test(pstring) Then
Set matches = regEx.Execute(pstring)
GetSaturday = matches(0).SubMatches(0)
End If
Additionally ((.|\n)*) is rather bad, instead use e.g. [\s\S]*?.

Regex - Return Exact Number of Consecutive Digits

I want to return 5 consecutive digits from a string (working in VBA).
Based on this post Regex I'm using the pattern [^\d]\d{5}[^\d], but this picks up the single letters immediately before and after the targeted 5 digits and returns h92345W(from "....South92345West").
How can I modify to return only the 5 consecutive digits: 92345
Sub RegexTest()
Dim strInput As String
Dim strPattern As String
strInput = "9129 Nor22 999123456 South92345West"
'strPattern = "^\d{5}$" 'No match
strPattern = "[^\d]\d{5}[^\d]" 'Returns additional letter before and after digits
'In this case returns: "h12345W"
MsgBox RegxFunc(strInput, strPattern)
End Sub
Function RegxFunc(strInput As String, regexPattern As String) As String
Dim regEx As New RegExp
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = regexPattern
End With
If regEx.test(strInput) Then
Set matches = regEx.Execute(strInput)
RegxFunc = matches(0).Value
Else
RegxFunc = "not matched"
End If
End Function
([^\d])(\d{5})([^\d])
You can use this regex, the matched terms should be in the 2nd group
You need to use a group:
"[^\d](\d{5})[^\d]"
And then the number will be in the first group. Not sure about the VBA syntax for grouping.

Regex return seven digit number match only

I've been trying to build a regular expression to extract a 7 digit number from a string but having difficulty getting the pattern correct.
Example string - WO1519641 WO1528113TB WO1530212 TB
Example return - 1519641, 1528113, 1530212
My code I'm using in Excel is...
Private Sub Extract7Digits()
Dim regEx As New RegExp
Dim strPattern As String
Dim strInput As String
Dim strReplace As String
Dim Myrange As Range
Set Myrange = ActiveSheet.Range("A1:A300")
For Each c In Myrange
strPattern = "\D(\d{7})\D"
'strPattern = "(?:\D)(\d{7})(?:\D)"
'strPattern = "(\d{7}(\D\d{7}\D))"
strInput = c.Value
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = strPattern
End With
If regEx.test(strInput) Then
Set matches = regEx.Execute(strInput)
For Each Match In matches
s = s & " Word: " & Match.Value & " "
Next
c.Offset(0, 1) = s
Else
s = ""
End If
Next
End Sub
I've tried all 3 patterns in that code but I end up getting a return of O1519641, O1528113T, O1530212 when using "\D(\d{7})\D". As I understand now the () doesn't mean anything because of the way I am storing the matches while I initially thought they meant that the expression would return what was inside the ().
I've been testing things on http://regexr.com/ but I'm still unsure of how to get it to allow the number to be inside the string as WO1528113TB is but only return the numbers. Do I need to run a RegEx on the returned value of the RegEx to exclude the letters the second time around?
I suggest using the following pattern:
strPattern = "(?:^|\D)(\d{7})(?!\d)"
Then, you will be able to access capturing group #1 contents (i.e. the text captured with the (\d{7}) part of the regex) via match.SubMatches(0), and then you may check which value is the largest.
Pattern details:
(?:^|\D) - a non-capturing group (does not create any submatch) matching the start of string (^) or a non-digit (\D)
(\d{7}) - Capturing group 1 matching 7 digits
(?!\d) - a negative lookahead failing the match if there is a digit immediately after the 7 digits.

VBA: Submatching regex

I have the following code:
Dim results(1) As String
Dim RE As Object, REMatches As Object
Set RE = CreateObject("vbscript.regexp")
With RE
.MultiLine = False
.Global = True
.IgnoreCase = True
.Pattern = "(.*?)(\[(.*)\])?"
End With
Set REMatches = RE.Execute(str)
results(0) = REMatches(0).submatches(0)
results(1) = REMatches(0).submatches(2)
Basically if I pass in a string "Test" I want it to return an array where the first element is Test and the second element is blank.
If I pass in a string "Test [bar]", the first element should be "Test " and the second element should be "bar".
I can't seem to find any issues with my regex. What am I doing wrong?
You need to add beginning and end of string anchors to your regex:
...
.Pattern = "^(.*?)(\[(.*)\])?$"
...
Without these anchors, the .*? will always match zero characters and since your group is optional it will never try to backtrack and match more.