Regex to get text between two strings - regex

I want to get dynamic six seven-digit numbers as shown below:
id="tid_3660328">
and append them to the end of TextBox1.
In other words, I want to get the number: 3660328
From between: id="tid_
and: ">
My question is how I could do this in VB.NET. My first thought was "regex", which is a topic I have zero experience on. I appreciate the help.
Note: I was thinking I could use the code here but with my own regex: https://stackoverflow.com/a/9332731

This is a good place for using RegEx.
If you only want to find numbers that are exactly seven digits you could use this RegEx pattern:
id="tid_(\d{7})">
Or, if you don't care how many digits it is, you could use this pattern:
id="tid_(\d+)">
Here's what the pattern means:
id="tid_ - Matching strings must begin with this text
(...) - Creates a group so that we can later access the value of just this part of the match.
\d - Any numeric digit character
{7} - Seven numeric characters in a row
"> - Matching strings must end with this text
In the second pattern, the +, which replaces the {7} just means one-or-more instead of exactly seven.
In VB.NET, you can search an input string using a RegEx pattern, like this:
Public Function FindNumbers(input As String) As List(Of String)
Dim numbers As New List(Of String)()
Dim pattern As String = "id=""tid_(\d{7})"">"
For Each i As Match In Regex.Matches(input, pattern)
numbers.Add(i.Groups(1).Value)
Next
Return numbers
End Function
Notice that in the string literal in VB, you have to escape the quotation marks by doubling them. You'll also notice that, instead of using i.Value, we are using i.Groups(1).Value. The reason is that i.Value will equal the entire matched string (e.g. id="tid_3660328">), whereas group 1 will equal just the number part (e.g. 3660328).
Update
To answer your question below, to call this function and output it to a TextBox, you could do something like this:
Dim numbers As List(Of String) = FindNumbers("id=""tid_3660328"">")
Text1.Text = String.Join(Environment.NewLine, numbers.ToArray())

Consider the following Regex...
(?<=tid_).*?(?=\"\>)
Explanation:
(?<=tid_) : Match the Prefix tid_ but exclude it from the capture
.*? : Any Character, any number of repetitions, as few as possible
(?=\">) : Match the suffix "> but exclude it from the capture

Related

Regular Expression: Find a specific group within other groups in VB.Net

I need to write a regular expression that has to replace everything except for a single group.
E.g
IN
OUT
OK THT PHP This is it 06222021
This is it
NO MTM PYT Get this content 111111
Get this content
I wrote the following Regular Expression: (\w{0,2}\s\w{0,3}\s\w{0,3}\s)(.*?)(\s\d{6}(\s|))
This RegEx creates 4 groups, using the first entry as an example the groups are:
OK THT PHP
This is it
06222021
Space Charachter
I need a way to:
Replace Group 1,2,4 with String.Empty
OR
Get Group 3, ONLY
You don't need 4 groups, you can use a single group 1 to be in the replacement and match 6-8 digits for the last part instead of only 6.
Note that this \w{0,2} will also match an empty string, you can use \w{1,2} if there has to be at least a single word char.
^\w{0,2}\s\w{0,3}\s\w{0,3}\s(.*?)\s\d{6,8}\s?$
^ Start of string
\w{0,2}\s\w{0,3}\s\w{0,3}\s Match 3 times word characters with a quantifier and a whitespace in between
(.*?) Capture group 1 match any char as least as possible
\s\d{6,8} Match a whitespace char and 6-8 digits
\s? Match an optional whitespace char
$ End of string
Regex demo
Example code
Dim s As String = "OK THT PHP This is it 06222021"
Dim result As String = Regex.Replace(s, "^\w{0,2}\s\w{0,3}\s\w{0,3}\s(.*?)\s\d{6,8}\s?$", "$1")
Console.WriteLine(result)
Output
This is it
My approach does not work with groups and does use a Replace operation. The match itself yields the desired result.
It uses look-around expressions. To find a pattern between two other patterns, you can use the general form
(?<=prefix)find(?=suffix)
This will only return find as match, excluding prefix and suffix.
If we insert your expressions, we get
(?<=\w{0,2}\s\w{0,3}\s\w{0,3}\s).*?(?=\s\d{6}\s?)
where I simplified (\s|) as \s?. We can also drop it completely, since we don't care about trailing spaces.
(?<=\w{0,2}\s\w{0,3}\s\w{0,3}\s).*?(?=\s\d{6})
Note that this works also if we have more than 6 digits because regex stops searching after it has found 6 digits and doesn't care about what follows.
This also gives a match if other things precede our pattern like in 123 OK THT PHP This is it 06222021. We can exclude such results by specifying that the search must start at the beginning of the string with ^.
If the exact length of the words and numbers does not matter, we simply write
(?<=^\w+\s\w+\s\w+\s).*?(?=\s\d+)
If the find part can contain numbers, we must specify that we want to match until the end of the line with $ (and include a possible space again).
(?<=^\w+\s\w+\s\w+\s).*?(?=\s\d+\s?$)
Finally, we use a quantifier for the 3 ocurrences of word-space:
(?<=^(\w+\s){3}).*?(?=\s\d+\s?$)
This is compact and will only return This is it or Get this content.
string result = Regex.Match(#"(?<=^(\w+\s){3}).*?(?=\s\d+\s?$)").Value;

Regex for masking data

I am trying to implement regex for a JSON Response on sensitive data.
JSON response comes with AccountNumber and AccountName.
Masking details are as below.
accountNumber Before: 7835673653678365
accountNumber Masked: 783567365367****
accountName Before : chris hemsworth
accountName Masked : chri* *********
I am able to match above if I just do [0-9]{12} and (?![0-9]{12}), when I replace this, it is replacing only with *, but my regex is not producing correct output.
How can I produce output as above from regex?
If all you want is to mask characters except first N characters, don't think you really a complicated regex. For ignoring first N characters and replacing every character there after with *, you can write a generic regex like this,
(?<=.{N}).
where N can be any number like 1,2,3 etc. and replace the match with *
The way this regex works is, it selects every character which has at least N characters before it and hence once it selects a character, all following characters also get selected.
For e.g in your AccountNumber case, N = 12, hence your regex becomes,
(?<=.{12}).
Regex Demo for AccountNumber masking
Java code,
String s = "7835673653678365";
System.out.println(s.replaceAll("(?<=.{12}).", "*"));
Prints,
783567365367****
And for AccountName case, N = 4, hence your regex becomes,
(?<=.{4}).
Regex Demo for AccountName masking
Java code,
String s = "chris hemsworth";
System.out.println(s.replaceAll("(?<=.{4}).", "*"));
Prints,
chri***********
If you match [0-9]{12} and replace that directly with a single asterix you are left with accountNumber Before: *8365
There is no programming language listed, but one option to replace the digits at the end is to use a positive lookbehind to assert what is on the left are 12 digits followed by a positive lookahead to assert what is on the right are 0+ digits followed by the end of the string.
Then in the replacement use *
If the value of the json exact the value of chris hemsworth and 7835673653678365 you can omit the positive lookaheads (?=\d*$) and (?=[\w ]*$) which assert the end of the string for the following 2 expressions.
Use the versions with the positive lookahead if the data to match is at the end of the string and the string contains more data so you don't replace more matches than you would expect.
(?<=[0-9]{12})(?=\d*$)\d
In Java:
(?<=[0-9]{12})(?=\\d*$)\\d
(?<=[0-9]{12}) Positive lookbehind, assert what is on the left are 12 digits
(?=\d*$) Positive lookahead, assert what is on the right are 0+ digits and assert the end of the string
\d Match a single digit
Regex demo
Result:
783567365367****
For the account name you might do that with 4 word characters \w but this will also replace the whitespace with an asterix because I believe you can not skip matching that space in one regex.
(?<=[\w ]{5})(?=[\w ]*$)[\w ]
In Java
(?<=[\\w ]{4})(?=[\\w ]*$)[\\w ]
Regex demo
Result
chri***********

Extracting groups of characters that match a pattern from a string in VB

I'm still a regex baby and need some help parsing a string.
I am using VB, and intend to run a string through NCalc, a library that parses mathematical equations from strings.
The problem is, the equations will have numbers, operations and variables.
An equation may look like this:
P20*4.143/((N2+N3)/2)
As you can see, P20, N2 and N3 are variables. In my case, they are stored in a datatable elsewhere in my application.
What I need to do is parse the string, looking for groups of characters in-between operations (-+/*), get their actual values and replace the variable with the value in the original string all while ignoring actual numbers.
The above string should become:
120.5*4.143/((4500+4570)/2)
So something like this:
Dim equation = "P20*4.143/((N2+N3)/2)"
For Each match As String In Regex(match_all_groups_with_letters)
return replace(match, value)
Next
Then I can do something like:
finalResult = NCalc.Doyourmagic(equation)
You may use a simple pattern like
"\b\d*\p{L}[\p{L}\d]*\b"
See the regex demo
It matches a leading word boundary \b, zero or more digits (\d*), a letter (\p{L}), and zero or more digits or letters ([\p{L}\d]*) followed with a trailing word boundary (\b).
Adjust the quantifiers accordingly (if the digits are always present, use \d+ instead of \d*). If the letters can only be ASCII letters, use [A-Za-z] (or just uppercase ASCII - [A-Z]) instead of \p{L} (that matches all Unicode letters).
Assuming that variable names always start with a letter and may contain numbers
Dim equation = "P20*4.143/((N2+N3)/2)"
Dim pattern = "[A-Za-z0-9]*[A-Za-z][A-Za-z0-9]*" ' Can start and end with numbers,
' must contain at least one letter.
Dim matches = Regex.Matches(equation, pattern)
For Each m As Match In matches
Dim value = GetValueFor(m.Value)
equation = Regex.Replace(equation, "\b" & m.Value & "\b", value)
Next
\b marks the beginning or end of a word. The square braces [] enclose character groups. The star * means zero, one or more repetitions of the preceding operation. So we have any number of repetitions of letters and digits, then exactly one letter, and finally again any number of repetitions of letters and digits.
Instead of re-using Regex for replacing the identifiers found, you could use the information provided in the match. It has Index and Length properties telling you the exact location of the identifiers in the equation. You can then replace by using string functions. But make sure to iterate the matches the reverse way in order to preserve the positions of the not yet processed identifiers.
You will have to somehow get the values corresponding to identifiers. A Dictionary(Of String, Double) is ideal for holding the values. The key is the identifier.

Regular expression for 7 digits followed by an optional 3 letters

I'm new to regular expressions, and I'm trying to validate receipt numbers in our database with a regular expression.
Our receipts can come in the following formats:
0123456 (Manditory seven digits, no more, no less)
0126456a (Manditory seven digits with one letter a-z)
0126456ab (Manditory seven digits with two letters a-z)
0126456abc (Manditory seven digits with three letters a-z)
I've tried using a bunch of different regex combinations, but none seem to work. Right now I have:
(\d)(\d)(\d)(\d)(\d)(\d)(\d)([a-z])?([a-z])?([a-z])?
But this is allowing for more than seven digits, and is allowing more than 3 letters.
Here is my VBA function within Access 2010 that will validate the expression:
Function ValidateReceiptNumber(ByVal sReceipt As String) As Boolean
If (Len(sReceipt) = 0) Then
ValidateReceiptNumber = False
Exit Function
End If
Dim oRegularExpression As RegExp
' Sets the regular expression object
Set oRegularExpression = New RegExp
With oRegularExpression
' Sets the regular expression pattern
.Pattern = "(\d)(\d)(\d)(\d)(\d)(\d)(\d)([a-z])?([a-z])?([a-z])?"
' Ignores case
.IgnoreCase = True
' Test Receipt string
ValidateReceiptNumber = .Test(sReceipt)
End With
End Function
You probably need to use anchors at the ends. Further your regex can be simplified to: -
^\d{7}[a-z]{0,3}$
\d{7} matches exactly 7 digits. You don't need to use \d 7
times for that.
{0,3} creates a range, and matches 0 to 3 repetition of preceding pattern,
Caret(^) matches the start of the line
Dollar($) matches the end of the line.
^(\d){7}[a-z]{0,3}$ might work well. The ^ and $ will match the start and end of line respectively.
You may want to make sure you are matching the entire string, by using anchors.
^(\d)(\d)(\d)(\d)(\d)(\d)(\d)([a-z])?([a-z])?([a-z])?$
You can also simplify the regex. First, you don't need all those parentheses.
^\d\d\d\d\d\d\d[a-z]?[a-z]?[a-z]?$
Also, you can use limited repetition, to prevent repeating yourself.
^\d{7}[a-z]{0,3}$
Where {7} means 'exactly 7 times', and {0,3} means '0-3 times'.

Recognize numbers in french format inside document using regex

I have a document containing numbers in various formats, french, english, custom formats.
I wanted a regex that could catch ONLY numbers in french format.
This is a complete list of numbers I want to catch (d represents a digit, decimal separator is comma , and thousands separator is space)
d,d d,dd d,ddd
dd,d dd,dd dd,ddd
ddd,d ddd,dd ddd,ddd
d ddd,d d ddd,dd d ddd,ddd
dd ddd,d dd ddd,dd dd ddd,ddd
ddd ddd,d ddd ddd,dd ddd ddd,ddd
d ddd ddd,d...
dd ddd ddd,d...
ddd ddd ddd,d...
This is the regex I have
(\d{1,3}\s(\d{3}\s)*\d{3}(\,\d{1,3})?|\d{1,3}\,\d{1,3})
catches french formats like above, so I am on the right track, but also numbers like d,ddd.dd (because it catches d,ddd) or d,ddd,ddd (because it catches d,ddd ).
What should I add to my regex ?
The VBA code I have:
Sub ChangeNumberFromFRformatToENformat()
Dim SectionText As String
Dim RegEx As Object, RegC As Object, RegM As Object
Dim i As Integer
Set RegEx = CreateObject("vbscript.regexp")
With RegEx
.Global = True
.MultiLine = False
.Pattern = "(\d{1,3}\s(\d{3}\s)*\d{3}(\,\d{1,3})?|\d{1,3}\,\d{1,3})"
' regular expression used for the macro to recognise FR formated numners
End With
For i = 1 To ActiveDocument.Sections.Count()
SectionText = ActiveDocument.Sections(i).Range.Text
If RegEx.test(SectionText) Then
Set RegC = RegEx.Execute(SectionText)
' RegC regular expresion matches collection, holding french format numbers
For Each RegM In RegC
Call ChangeThousandAndDecimalSeparator(RegM.Value)
Next 'For Each RegM In RegC
Set RegC = Nothing
Set RegM = Nothing
End If
Next 'For i = 6 To ActiveDocument.Sections.Count()
Set RegEx = Nothing
End Sub
The user stema, gave me a nice solution. The regex should be:
(?<=^|\s)\d{1,3}(?:\s\d{3})*(?:\,\d{1,3})?(?=\s|$)
But VBA complains that the regexp has unescaped characters. I have found one here (?: \d{3}) between (?: \d{3}) which is a blank character, so I can substitute that with \s. The second one I think is here (?:,\d{1,3}) between ?: and \d, the comma character, and if I escape it will be \, .
So the regex is now (?<=^|\s)\d{1,3}(?:\s\d{3})*(?:\,\d{1,3})?(?=\s|$) and it works fine in RegExr but my VBA code will not accept it.
NEW LINE IN POST :
I have just discovered that VBA doesn't agree with this sequence of the regex ?<=^
What about this?
\b\d{1,3}(?: \d{3})*(?:,\d{1,3})?\b
See it here on Regexr
\b are word boundaries
At first (\d{1,3}) match 1 to 3 digits, then there can be 0 or more groups of a leading space followed by 3 digits ((?: \d{3})*) and at last there can be an optional fraction part ((?:,\d{1,3})?)
Edit:
if you want to avoid 1,111.1 then the \b anchors are not good for you. Try this:
(?<=^|\s)\d{1,3}(?: \d{3})*(?:,\d{1,3})?(?=\s|$)
Regexr
This regex requires now a whitespace or the start of the string before and a whitespace or the end of the string after the number to match.
Edit 2:
Since look behinds are not supported you can change to
(?:^|\s)\d{1,3}(?: \d{3})*(?:,\d{1,3})?(?=\s|$)
This changes nothing at the start of the string, but if the number starts with a leading whitespace, this is now included in the match. If the result of the match is used for something at first the leading whitespace has to be stripped (I am quite sure VBA does have a methond for that (try trim())).
If you are reading on a line by line basis, you might consider adding anchors (^ and $) to your regex, so you will end up with something like so:
^(\d{1,3}\s(\d{3}\s)*\d{3}(\,\d{1,3})?|\d{1,3}\,\d{1,3})$
This instructs the RegEx engine to start matching from the beginning of the line till the very end.