Recognize numbers in french format inside document using regex - regex

I have a document containing numbers in various formats, french, english, custom formats.
I wanted a regex that could catch ONLY numbers in french format.
This is a complete list of numbers I want to catch (d represents a digit, decimal separator is comma , and thousands separator is space)
d,d d,dd d,ddd
dd,d dd,dd dd,ddd
ddd,d ddd,dd ddd,ddd
d ddd,d d ddd,dd d ddd,ddd
dd ddd,d dd ddd,dd dd ddd,ddd
ddd ddd,d ddd ddd,dd ddd ddd,ddd
d ddd ddd,d...
dd ddd ddd,d...
ddd ddd ddd,d...
This is the regex I have
(\d{1,3}\s(\d{3}\s)*\d{3}(\,\d{1,3})?|\d{1,3}\,\d{1,3})
catches french formats like above, so I am on the right track, but also numbers like d,ddd.dd (because it catches d,ddd) or d,ddd,ddd (because it catches d,ddd ).
What should I add to my regex ?
The VBA code I have:
Sub ChangeNumberFromFRformatToENformat()
Dim SectionText As String
Dim RegEx As Object, RegC As Object, RegM As Object
Dim i As Integer
Set RegEx = CreateObject("vbscript.regexp")
With RegEx
.Global = True
.MultiLine = False
.Pattern = "(\d{1,3}\s(\d{3}\s)*\d{3}(\,\d{1,3})?|\d{1,3}\,\d{1,3})"
' regular expression used for the macro to recognise FR formated numners
End With
For i = 1 To ActiveDocument.Sections.Count()
SectionText = ActiveDocument.Sections(i).Range.Text
If RegEx.test(SectionText) Then
Set RegC = RegEx.Execute(SectionText)
' RegC regular expresion matches collection, holding french format numbers
For Each RegM In RegC
Call ChangeThousandAndDecimalSeparator(RegM.Value)
Next 'For Each RegM In RegC
Set RegC = Nothing
Set RegM = Nothing
End If
Next 'For i = 6 To ActiveDocument.Sections.Count()
Set RegEx = Nothing
End Sub
The user stema, gave me a nice solution. The regex should be:
(?<=^|\s)\d{1,3}(?:\s\d{3})*(?:\,\d{1,3})?(?=\s|$)
But VBA complains that the regexp has unescaped characters. I have found one here (?: \d{3}) between (?: \d{3}) which is a blank character, so I can substitute that with \s. The second one I think is here (?:,\d{1,3}) between ?: and \d, the comma character, and if I escape it will be \, .
So the regex is now (?<=^|\s)\d{1,3}(?:\s\d{3})*(?:\,\d{1,3})?(?=\s|$) and it works fine in RegExr but my VBA code will not accept it.
NEW LINE IN POST :
I have just discovered that VBA doesn't agree with this sequence of the regex ?<=^

What about this?
\b\d{1,3}(?: \d{3})*(?:,\d{1,3})?\b
See it here on Regexr
\b are word boundaries
At first (\d{1,3}) match 1 to 3 digits, then there can be 0 or more groups of a leading space followed by 3 digits ((?: \d{3})*) and at last there can be an optional fraction part ((?:,\d{1,3})?)
Edit:
if you want to avoid 1,111.1 then the \b anchors are not good for you. Try this:
(?<=^|\s)\d{1,3}(?: \d{3})*(?:,\d{1,3})?(?=\s|$)
Regexr
This regex requires now a whitespace or the start of the string before and a whitespace or the end of the string after the number to match.
Edit 2:
Since look behinds are not supported you can change to
(?:^|\s)\d{1,3}(?: \d{3})*(?:,\d{1,3})?(?=\s|$)
This changes nothing at the start of the string, but if the number starts with a leading whitespace, this is now included in the match. If the result of the match is used for something at first the leading whitespace has to be stripped (I am quite sure VBA does have a methond for that (try trim())).

If you are reading on a line by line basis, you might consider adding anchors (^ and $) to your regex, so you will end up with something like so:
^(\d{1,3}\s(\d{3}\s)*\d{3}(\,\d{1,3})?|\d{1,3}\,\d{1,3})$
This instructs the RegEx engine to start matching from the beginning of the line till the very end.

Related

Use Regex to Split Numbered List array into Numbered List Multiline

I am trying to learn Regex to answer a question on SO portuguese.
Input (Array or String on a Cell, so .MultiLine = False)?
1 One without dot. 2. Some Random String. 3.1 With SubItens. 3.2 With number 0n mid. 4. Number 9 incorrect. 11.12 More than one digit. 12.7 Ending (no word).
Output
1 One without dot.
2. Some Random String.
3.1 With SubItens.
3.2 With number 0n mid.
4. Number 9 incorrect.
11.12 More than one digit.
12.7 Ending (no word).
What i thought was to use Regex with Split, but i wasn't able to implement the example on Excel.
Imports System.Text.RegularExpressions
Module Example
Public Sub Main()
Dim input As String = "plum-pear"
Dim pattern As String = "(-)"
Dim substrings() As String = Regex.Split(input, pattern) ' Split on hyphens.
For Each match As String In substrings
Console.WriteLine("'{0}'", match)
Next
End Sub
End Module
' The method writes the following to the console:
' 'plum'
' '-'
' 'pear'
So reading this and this. The RegExr Website was used with the expression /([0-9]{1,2})([.]{0,1})([0-9]{0,2})/igm on the Input.
And the following is obtained:
Is there a better way to make this? Is the Regex Correct or a better way to generate? The examples that i found on google didn't enlight me on how to use RegEx with Split correctly.
Maybe I am confusing with the logic of Split Function, which i wanted to get the split index and the separator string was the regex.
I can make that it ends with word and period
Use
\d+(?:\.\d+)*[\s\S]*?\w+\.
See the regex demo.
Details
\d+ - 1 or more digits
(?:\.\d+)* - zero or more sequences of:
\. - dot
\d+ - 1 or more digits
[\s\S]*? - any 0+ chars, as few as possible, up to the first...
\w+\. - 1+ word chars followed with ..
Here is a sample VBA code:
Dim str As String
Dim objMatches As Object
str = " 1 One without dot. 2. Some Random String. 3.1 With SubItens. 3.2 With Another SubItem. 4. List item. 11.12 More than one digit."
Set objRegExp = New regexp ' CreateObject("VBScript.RegExp")
objRegExp.Pattern = "\d+(?:\.\d+)*[\s\S]*?\w+\."
objRegExp.Global = True
Set objMatches = objRegExp.Execute(str)
If objMatches.Count <> 0 Then
For Each m In objMatches
Debug.Print m.Value
Next
End If
NOTE
You may require the matches to only stop at the word + . that are followed with 0+ whitespaces and a number using \d+(?:\.\d+)*[\s\S]*?[a-zA-Z]+\.(?=\s*(?:\d+|$)).
The (?=\s*(?:\d+|$)) positive lookahead requires the presence of 0+ whitespaces (\s*) followed with 1+ digits (\d+) or end of string ($) immediately to the right of the current location.
If VBA's split supports look-behind regex then this one may work, assuming there's no digit except in the indexes:
\s(?=\d)

Validating a string's first 3 letters as uppercase with regex

I have a question on Classic ASP regarding validating a string's first 3 letters to be uppercase while the last 4 characters should be in numerical form using regex.
For e.g.:
dim myString = "abc1234"
How do I validate that it should be "ABC1234" instead of "abc1234"?
Apologies for my broken English and for being a newbie in Classic ASP.
#ndn has a good regex pattern for you. To apply it in Classic ASP, you just need to create a RegExp object that uses the pattern and then call the Test() function to test your string against the pattern.
For example:
Dim re
Set re = New RegExp
re.Pattern = "^[A-Z]{3}.*[0-9]{4}$" ' #ndn's pattern
If re.Test(myString) Then
' Match. First three characters are uppercase letters and last four are digits.
Else
' No match.
End If
^[A-Z]{3}.*[0-9]{4}$
Explanation:
Surround everything with ^$ (start and end of string) to ensure you are matching everything
[A-Z] - gives you all capital letters in the English alphabet
{3} - three of those
.* - optionally, there can be something in between (if there can't be, you can just remove this)
[0-9] - any digit
{4} - 4 of those

VB.NET regex searching for AAA-9999

I need help with finding the first 3 capital letters A-Z and then a space followed by 4 numbers 0-9.
Dim IndividualClasses As MatchCollection = Regex.Matches(AllExitClasses(a), "([A-Z])([A-Z])([A-Z]) ([0-9])([0-9])([0-9])([0-9])")
An example input string would be AML 4309 or DEF 4298.
The above 7 characters are what I want to get out of string.
EDIT: Since you preprocess your input string, you can use
Dim IndividualClasses As MatchCollection = Regex.Matches(AllExitClasses(a).Replace(" ", "-"), "[A-Z]{3}[-][0-9]{4}")
REGEX EXPLANATION:
[A-Z]{3} - 3 occurrences of English letters A to Z
[-] - A character class matching exactly one hyphen
[0-9]{4} - Exactly 4 occurrences of digits from 0 to 9.
Note that I removed capturing groups since you do not seem to be using them at all, and I am using limiting quantifiers, e.g. {4}.
Note that you could use your input string as is and previous regex [A-Z]{3}\p{Zs}[0-9]{4}, but you would need to iterate through the match collection and replace a space in each Match.Value with a hyphen creating a new array.
Here is an IDEONE demo
Ok I replaced the spaces with a dash
then I am using this Regular expression
"([A-Z])([A-Z])([A-Z])([-])([0-9])([0-9])([0-9])([0-9])")
which works
AllExitClasses(a) = AllExitClasses(a).Replace(" ", "-")
'
MyClassString = AllExitClasses(a).ToString
Dim IndividualClasses As MatchCollection = Regex.Matches(MyClassString, "([A-Z])([A-Z])([A-Z])([-])([0-9])([0-9])([0-9])([0-9])")
Regex.Matches([variable], "^([A-Z]{3,3})(\s)([0-9]{4,4})$")
This regex will find your AAA 1111 (3 uppercase letters with [A-Z]{3,3}; one white space with (\s); and exactly 4 digits with ([0-9]{4})). I have found that http://regex101.com helps a lot with expressions in different languages.

Regex to get text between two strings

I want to get dynamic six seven-digit numbers as shown below:
id="tid_3660328">
and append them to the end of TextBox1.
In other words, I want to get the number: 3660328
From between: id="tid_
and: ">
My question is how I could do this in VB.NET. My first thought was "regex", which is a topic I have zero experience on. I appreciate the help.
Note: I was thinking I could use the code here but with my own regex: https://stackoverflow.com/a/9332731
This is a good place for using RegEx.
If you only want to find numbers that are exactly seven digits you could use this RegEx pattern:
id="tid_(\d{7})">
Or, if you don't care how many digits it is, you could use this pattern:
id="tid_(\d+)">
Here's what the pattern means:
id="tid_ - Matching strings must begin with this text
(...) - Creates a group so that we can later access the value of just this part of the match.
\d - Any numeric digit character
{7} - Seven numeric characters in a row
"> - Matching strings must end with this text
In the second pattern, the +, which replaces the {7} just means one-or-more instead of exactly seven.
In VB.NET, you can search an input string using a RegEx pattern, like this:
Public Function FindNumbers(input As String) As List(Of String)
Dim numbers As New List(Of String)()
Dim pattern As String = "id=""tid_(\d{7})"">"
For Each i As Match In Regex.Matches(input, pattern)
numbers.Add(i.Groups(1).Value)
Next
Return numbers
End Function
Notice that in the string literal in VB, you have to escape the quotation marks by doubling them. You'll also notice that, instead of using i.Value, we are using i.Groups(1).Value. The reason is that i.Value will equal the entire matched string (e.g. id="tid_3660328">), whereas group 1 will equal just the number part (e.g. 3660328).
Update
To answer your question below, to call this function and output it to a TextBox, you could do something like this:
Dim numbers As List(Of String) = FindNumbers("id=""tid_3660328"">")
Text1.Text = String.Join(Environment.NewLine, numbers.ToArray())
Consider the following Regex...
(?<=tid_).*?(?=\"\>)
Explanation:
(?<=tid_) : Match the Prefix tid_ but exclude it from the capture
.*? : Any Character, any number of repetitions, as few as possible
(?=\">) : Match the suffix "> but exclude it from the capture

Putting space in camel case string using regular expression

I am driving my question from add a space between two words.
Requirement: Split a camel case string and put spaces just before the capital letter which is followed by a small case letter or may be nothing. The space should not incur between capital letters.
eg: CSVFilesAreCoolButTXT is a string I want to yield it this way CSV Files Are Cool But TXT
I drove a regular express this way:
"LightPurple".replace(/([a-z])([A-Z])/, '$1 $2')
If you have more than 2 words, then you'll need to use the g flag, to match them all.
"LightPurpleCar".replace(/([a-z])([A-Z])/g, '$1 $2')
If are trying to split words like CSVFile then you might need to use this regexp instead:
"CSVFilesAreCool".replace(/([a-zA-Z])([A-Z])([a-z])/g, '$1 $2$3')
But still it does not serve the way I have put my requirements.
var rex = /([A-Z])([A-Z])([a-z])|([a-z])([A-Z])/g;
"CSVFilesAreCoolButTXT".replace( rex, '$1$4 $2$3$5' );
// "CSV Files Are Cool But TXT"
And also
"CSVFilesAreCoolButTXTRules".replace( rex, '$1$4 $2$3$5' );
// "CSV Files Are Cool But TXT Rules"
The text of the subject string that matches the regex pattern will be replaced by the replacement string '$1$4 $2$3$5', where the $1, $2 etc. refer to the substrings matched by the pattern's capture groups ().
$1 refers to the substring matched by the first ([A-Z]) sub-pattern, and $3 refers to the substring matched by the first ([a-z]) sub-pattern etc.
Because of the alternation character |, to make a match the regex will have to match either the ([A-Z])([A-Z])([a-z]) sub-pattern or the ([a-z])([A-Z]) sub-pattern, so if a match is made several of the capture groups will remain unmatched. These capture groups can be referenced in the replacement string but they have have no effect upon it - effectively, they will reference an empty string.
The space in the replacement string ensures a space is inserted in the subject string every time a match is made (the trailing g flag means the regular expression engine will look for more than one match).
If the first character is always lowercase.
'camelCaseString'.replace(/([A-Z]+)/g, ' $1')
If the first character is uppercase.
'CamelCaseString'.replace(/([A-Z]+)/g, ' $1').replace(/^ /, '')
Splitting CamelCase with regex in .NET :
Regex.Replace(input, "((?<!^)([A-Z][a-z]|(?<=[a-z])[A-Z]))", " $1").Trim();
Example :
Regex.Replace("TheCapitalOfTheUAEIsAbuDhabi", "((?<!^)([A-Z][a-z]|(?<=[a-z])[A-Z]))", " $1").Trim();
Output :
The Capital Of The UAE Is Abu Dhabi
This worked for me
let camelCase = "CSVFilesAreCoolButTXTRules"
let re = /[A-Z-_\&](?=[a-z0-9]+)|[A-Z-_\&]+(?![a-z0-9])/g
let delimited = camelCase.replace(re,' $&').trim()
The above code works for almost all the use cases i had. I had a few peculiarities where '&' and '_' should be treated equivalent to an upper case character
ThisIsASlug ---> This Is A Slug
loremIpsum ---> lorem Ipsum
PAGS_US ---> PAGS_US
TheCapitalOfTheUAEIsAbuDhabi ---> The Capital Of The UAE Is Abu Dhabi
eclipseRCPExt ---> eclipse RCP Ext
VALUE ---> VALUE
SG&A ---> SG&A
A brief explanation
[A-Z-_\&](?=[a-z0-9]+)
//Matches normal words i.e. one uppercase followed by one or more non-uppercase characters
[A-Z-_\&]+(?![a-z0-9])
//Matches acronyms & abbreviations i.e. a sequence of uppercase characters that are not followed by non-uppercase characters
Check out the regexr fiddle here
Camel-case replacement for Javascript using lookaheads / behinds:
"TheCapitalOfTheUAEIsAbuDhabi".replace(/([A-Z](?=[a-z]+)|[A-Z]+(?![a-z]))/g, ' $1').trim()
// "The Capital Of The UAE Is Abu Dhabi"