Get last 3 characters of matched pattern RUTA - regex

I am trying to get last 3 characters of a pattern. But I am stuck on how to do it.
Please share your thoughts on this.
PACKAGE uima.ruta.example;
Document{->RETAINTYPE(SPACE)};
DECLARE VarA;
((W|NUM)* (W|NUM)*){REGEXP(".{12}")-> MARK(VarA),MARK(EntityType,1), UNMARK(VarA)};
I/P - AB1234567CAB
O/P - CAB

You can use $ to indicate where the end of the source string should be in the pattern. For your example, you want the last 3 characters, so you could use a pattern like:
.{3}$
To get the last 3 characters. This would get any character (apart from a \n), but you could be more specific, for example if you just want uppercase letters, you could use:
[A-Z]{3}$
or if you could accept uppercase, lowercase or numbers, you could use
\w{3}$
Experiment on regex101.com to see what works for you.

Suppose your data in cell A1
You can Use the second macro of this two ones
Option Explicit
Sub Extract_Laste_3Carachters(st As Range, Patt$, n)
Dim Obj As Object
Set Obj = CreateObject("Vbscript.RegExp")
With Obj
.Pattern = Patt
.Global = True
End With
If Len(st) <= 3 Then st.Offset(, 1) = st: Exit Sub
If Obj.test(st) Then
If n > Obj.Execute(st).Count Then n = Obj.Execute(st).Count
st.Offset(, 1) = _
Obj.Execute(st)(n - 3) _
& Obj.Execute(st)(n - 2) _
& Obj.Execute(st)(n - 1)
End If
End Sub
'+++++++++++++++++++++++++++++++
Sub Test_Me()
Call Extract_Laste_3Carachters(Range("a1"), ("\w"), Len(Range("a1")))
End Sub

I tried below code and its working now!
PACKAGE uima.ruta.example;
Document{->RETAINTYPE(SPACE)};
"(?i)\\b(?=.*\\d)[1]{0,1}[A-Z0-9]{2}[\\s |-]{0,2}[A-Z0-9]{7}[\\s |-]{0,2}([A-Z]{3})\\b" ->1 = EntityType;

Related

Regex and LINQ extract group by group name

I have a SQL sintax in the form of a string in which there are some parameters written in a standard way "<parameter1>, <parameter2>".
Then i have another string with the parameters values written in a standard way as well: "Parameter1=123; Parameter2=aaa".
I need to match the parameters in the first SQL with the values in the second one.
What I have so far:`
Dim BodySQL = "Blablabal WHERE X=<Parameter1> AND Y=<Parameter2>"
Dim vmp As RegularExpressions.MatchCollection = RegularExpressions.Regex.Matches("Parameter1=2555; Parameter2 = 12/02/2021", "([\w ]+)=([\w ]+)")
Dim vmc As RegularExpressions.MatchCollection = RegularExpressions.Regex.Matches(BodySQL, "(?<=\<).+?(?=\>)")
For Each vm As RegularExpressions.Match In vmc
Dim Vl As String = (From m As RegularExpressions.Match In vmp
Where m.Groups(1).Value.Trim = vm.Value.ToString).Select(Of String)(Function(f) f.Groups(2).Value).ElementAt(0).Trim
BodySQL = BodySQL.Replace(vm.Value, Vl)
Next
It works for the first parameter, but then i get
"System.ArgumentOutOfRangeException: 'Specified argument was out of the range of valid values.
Parameter name: index'"
Can I please ask why?
You can extract the keys and values with one regex from the param=value strings, create a dictionary out of them, and then use Regex.Replace to replace the matches with the dictionary values:
Imports System.Text.RegularExpressions
' ...
Dim BodySQL = "Blablabal WHERE X=<Parameter1> AND Y=<Parameter2>"
Dim args As New Dictionary(Of String, String)(StringComparer.InvariantCultureIgnoreCase)
' (StringComparer.InvariantCultureIgnoreCase) makes the dictionary keys case insensitive
For Each match As Match In Regex.Matches("PARAMETER1=2555; Parameter2 = 12/02/2021", "(\S+)\s*=\s*([^;]*[^;\s])")
args.Add(match.Groups(1).Value, match.Groups(2).Value)
Next
Console.WriteLine(Regex.Replace(BodySQL, "<([^<>]*)>",
Function(match)
Return If(args.ContainsKey(match.Groups(1).Value), args(match.Groups(1).Value), match.Value)
End Function))
Output:
Blablabal WHERE X=2555 AND Y=12/02/2021
The (\S+)\s*=\s*([^;]*[^;\s]) pattern matches
(\S+) - captures into Group 1 any one or more non-whitespace chars (the key value)
\s*=\s* - a = char enclosed with zero or more whitespace chars
([^;]*[^;\s]) - Group 2: any zero or more chars other than ; and then one char other than ; and whitespace (the value, it cannot be empty with this pattern. If you want it to be possible to match empty values, you will need to remove [^;\s] and then use Trim() on the match.Groups(2).Value in the code.)
The <([^<>]*)> regex matches
< - a < char (do not escape this char in any regex flavor please, it is never a special regex metachar)
([^<>]*) - Group 1: any zero or more chars other than < and >
> - a literal > char.
Since the key is in Group 1 and < and > on both ends are consumed, the < and > are removed when replacing with the found value.
zero or more chars other than > and < between < and >.
The error is self explanatory. You are trying to access an array or List and specifying an index value that is either negative or larger than the largest index available.
.ElementAt(0) / m.Groups(1) / f.Groups(2)
My guess is that one of them might go out of bounds. Try to debug it with a breakpoint and check the values of your variables.
This is what i did with your code:
Dim vmc = "\<(.*?)\>"
i changed this regex so that it could also give me the "<>"
Dim BodySQL = "Blablabal WHERE X=<Parameter1> AND Y=<Parameter2>"
Dim args As New Dictionary(Of String, String)
For Each match As Match In Regex.Matches("Parameter1=2555; Parameter2 = 12/02/2021", "(\S+)\s*=\s*(\S[^;]+)")
i changed the regex expression to exclude the ";"
args.Add(match.Groups(1).Value, match.Groups(2).Value)
Next
Console.WriteLine(Regex.Replace(BodySQL, vmc,
Function(match As Match)
Return If(args.ContainsKey(match.Groups(1).Value), args(match.Groups(1).Value), match.Value)
End Function))
And now i have what i needed. Thank you a lot :)
Output:
WHERE X = 2555,Y = 12/02/2021

Match n amount of words separated by commas after base text

I would like to match an infinite amount of words separated by commas and whitespaces.
Is there a better solution than just repeating the search parameter?
Sample:
"2_i Art des Problems:\s*(.[^,\s]+)[,]\s*(.[^,\s]+)[,]\s*(.[^,\s]+)"
2_i Art des Problems: Elektrisch, Schweißausrüstung, Burgenland
View on regex101: https://regex101.com/r/yP7PPO/1
Full code for this operation:
With Reg1
.Pattern = "2_i Art des Problems:+\s*([^\r\n]*\S)"
.Global = False
End With
If Reg1.Test(olMail.Body) Then
Set M1 = Reg1.Execute(olMail.Body)
End If
For Each M In M1
With xExcelApp
Select Case M.SubMatches
Case Software
Range("D6").Value = 1
Case Mechanisch
Range("E6").Value = 1
Case Elektrisch
Range("F6").Value = 1
Case Roboter
Range("G6").Value = 1
Case Schweißausrüstung
Range("H6").Value = 1
Case Anwendung
Range("I6").Value = 1
Case Ersatzteil
Range("J6").Value = 1
Case Else
Range("K6").Value = 1
End Select
End With
Next M
Does it really need to be a RegEx?
I think this is over complicating things as this can easily be solved with Split():
Option Explicit
Public Sub Example()
Const TestString As String = "2_i Art des Problems: Elektrisch, Schweißausrüstung, Burgenland"
Const ConstantPart As String = "2_i Art des Problems: "
If Left$(TestString, Len(ConstantPart)) = ConstantPart Then
Dim Parts() As String
Parts = Split(Mid$(TestString, Len(ConstantPart) + 1), ", ")
Dim Part As Variant
For Each Part In Parts
Debug.Print Part
Next Part
End If
End Sub
Output is:
Elektrisch
Schweißausrüstung
Burgenland
If you realy need to use regexp than use global flag and e.g. this regexp
(.[^,\s]+)(,|$)
Explanation here
With regEx
.Global = True
Use .SubMatches to get capturing groups values
EDIT:
according to one of comment "Then you still need to Trim the matches because they will include the spaces. – Pᴇʜ 1 min ago"
you can still use regexp
.([^,\s]+)(,|$)
check

Extract an 8 digits number from a string with additional conditions

I need to extract a number from a string with several conditions.
It has to start with 1-9, not with 0, and it will have 8 digits. Like 23242526 or 65478932
There will be either an empty space or a text variable before it. Like MMX: 23242526 or bgr65478932
It could have come in rare cases: 23,242,526
It ends with an emty space or a text variable.
Here are several examples:
From RE: Markitwire: 120432889: Mx: 24,693,059 i need to get 24693059
From Automatic reply: Auftrag zur Übertragung IRD Ref-Nr. MMX_23497152 need to get 23497152
From FW: CGMSE 2019-2X A1AN XS2022418672 Contract 24663537 need to get 24663537
From RE: BBVA-MAD MMX_24644644 + MMX_24644645 need to get 24644644, 24644645
Right now I'm using the regexextract function(found it on this web-site), which extracts any number with 8 digits starting with 2. However it would also extract a number from, let's say, this expression TGF00023242526, which is incorrect. Moreover, I don't know how to add additional conditions to the code.
=RegexExtract(A11, ""(2\d{7})\b"", ", ")
Thank you in advance.
Function RegexExtract(ByVal text As String, _
ByVal extract_what As String, _
Optional seperator As String = "") As String
Dim i As Long, j As Long
Dim result As String
Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
RE.Pattern = extract_what
RE.Global = True
RE.IgnoreCase = True
Set allMatches = RE.Execute(text)
For i = 0 To allMatches.Count - 1
For j = 0 To allMatches.Item(i).SubMatches.Count - 1
result = result & seperator & allMatches.Item(i).SubMatches.Item(j)
Next
Next
If Len(result) <> 0 Then
result = Right(result, Len(result) - Len(seperator))
End If
RegexExtract = result
End Function
You may create a custom boundary using a non-capturing group before the pattern you have:
(?:[\D0]|^)(2\d{7})\b
^^^^^^^^^^^
The (?:[\D0]|^) part matches either a non-digit (\D) or 0 or (|) start of string (^).
As an alternative to also match 8 digits in values like 23,242,526 and start with a digit 1-9 you might use
\b[1-9](?:,?\d){7}\b
\b Word boundary
[1-9] Match the firstdigit 1-9
(?:,?\d){7} Repeat 7 times matching an optional comma and a digit
\b Word boundary
Regex demo
Then you could afterwards replace the comma's with an empty string.

Check string has a date in it and extract part of the string

I have thousands of lines of text that I need to work through and the lines I am interested with lines that look like the following:
01/04/2019 09:35:41 - Test user (Additional Comments)
I am currently using this code to filter out all the other rows:
If InStr(FullCell(i), " - ") <> 0 And InStr(FullCell(i), ":") <> 0 And InStr(FullCell(i), "(") <> 0 Then
FullCell is the array that I am working through.
which I know is not the best way to do it. Is there a way to check that there is a date at the beginning of the string in the format dd/mm/yyyy and then extract the user name inbetween the '-' and the '(' symbol.
I had a play with regex to see if that could help but i'm limited in skills to be able to pull off both VBA and regex in the same code.
Whats the best way to do this.
Assuming Fullcell(i) contains the string,
If Left(Fullcell(i), 10) Like "##/##/####"
Will return True if you have a date (note that it will not differentiate between dd/mm/yyyy and mm/dd/yyyy.
And
Mid(Fullcell(i), InStr(Fullcell(i), " - ") + 2, InStr(Fullcell(i), " (") - InStr(Fullcell(i), " - ") - 2)
Will return the username
I'm sure there is a more efficient way to do this, but I've used the following solution quite a few times:
This will select the date:
x = 1
Do While Mid(FullCell,1,x) <> " "
x = x + 1
Loop
strDate = Left(FullCell,x)
This will find the character number of the hyphen, the username starts 2 characters after.
x = 1
Do While Mid(FullCell,x,1) <> "-"
x = x + 1
Loop
Then we will find the end of the username
y = x + 2
Do While Mid(FullCell,y,1) <> " "
y = y + 1
Loop
The username should now be characters (x+2 to y-1)
strUsername = Mid(FullCell, x + 2, y - (x + 2) - 1)
Here's how I would do it
Dim your variables
Dim ring as Range
Dim dat as variant
Dim FullCell() as string
Dim User as string
Dim I as long
Set your range
Set rng = ` any way you choose
Dat = rng.value2
Loop dat
For i = 1 to UBound(dat, 1)
Split the data
FullCell = Trim(Split(FullCell, "-"))
Test if it split
If UBound(FullCell) > 0 Then
Test if it matches
If IsDate(FullCell(0)) Then
i = Instr(FullCell(1), "(")-1)
If i then
User = left$(FullCell(1), i)
' Found a user
End If
End If
End If
Next
Abstraction is your friend, it's always helpful to break these into their own private functions whenever you can. You could put your code in a function and call it something like ExtractUsername.
Below I did an example of this, and I decided to go with the RegExp approach (late binding), but you could use string functions like the examples above as well.
This function returns the username if it finds the pattern you mentioned above, otherwise, it returns an empty string.
Private Function ExtractUsername(ByVal SourceString As String) As String
Dim RegEx As Object
Set RegEx = CreateObject("vbscript.regexp")
'(FIRST GROUP FINDS THE DATE FORMATTED AS DD/MM/YYY, AS WELL AS THE FORWARD SLASH)
'(SECOND GROUP FINDS THE USERNAME) THIS WILL BE SUBMATCH 1
With RegEx
.Pattern = "(^\d{2}\/\d{2}\/\d{4}.*-)(.+)(\()"
.Global = True
End With
Dim Match As Object
Set Match = RegEx.Execute(SourceString)
'ONLY RETURN IF A MATCH WAS FOUND
If Match.Count > 0 Then
ExtractUsername = Trim(Match(0).SubMatches(1))
End If
Set RegEx = Nothing
End Function
The regex pattern is grouped into three parts, the date (and slash), username, and opening parentheses. What you are interested in is the username, which in the SubMatch would be number 1.
Regexr is a helpful site for practicing regular expressions and can show you a bit more of what the pattern I went with is doing.
Please note that using regular expressions might give you performance issues and you should test it against regular string functions to see what works best for your situation.

Extract largest numeric sequence from string (regex, or?)

I have strings similar to the following:
4123499-TESCO45-123
every99999994_54
And I want to extract the largest numeric sequence in each string, respectively:
4123499
99999994
I have previously tried regex (I am using VB6)
Set rx = New RegExp
rx.Pattern = "[^\d]"
rx.Global = True
StringText = rx.Replace(StringText, "")
Which gets me partway there, but it only removes the non-numeric values, and I end up with the first string looking like:
412349945123
Can I find a regex that will give me what I require, or will I have to try another method? Essentially, my pattern would have to be anything that isn't the longest numeric sequence. But I'm not actually sure if that is even a reasonable pattern. Could anyone with a better handle of regex tell me if I am going down a rabbit hole? I appreciate any help!
You cannot get the result by just a regex. You will have to extract all numeric chunks and get the longest one using other programming means.
Here is an example:
Dim strPattern As String: strPattern = "\d+"
Dim str As String: str = "4123499-TESCO45-123"
Dim regEx As New RegExp
Dim matches As MatchCollection
Dim match As Match
Dim result As String
With regEx
.Global = True
.MultiLine = False
.IgnoreCase = False
.Pattern = strPattern
End With
Set matches = regEx.Execute(str)
For Each m In matches
If result < Len(m.Value) Then result = m.Value
Next
Debug.Print result
The \d+ with RegExp.Global=True will find all digit chunks and then only the longest will be printed after all matches are processed in a loop.
That's not solvable with an RE on its own.
Instead you can simply walk along the string tracking the longest consecutive digit group:
For i = 1 To Len(StringText)
If IsNumeric(Mid$(StringText, i, 1)) Then
a = a & Mid$(StringText, i, 1)
Else
a = ""
End If
If Len(a) > Len(longest) Then longest = a
Next
MsgBox longest
(first result wins a tie)
If the two examples you gave, are of a standard where:
<long_number>-<some_other_data>-<short_number>
<text><long_number>_<short_number>
Are the two formats that the strings come in, there are some solutions.
However, if you are searching any string in any format for the longest number, these will not work.
Solution 1
([0-9]+)[_-].*
See the demo
In the first capture group, you should have the longest number for those 2 formats.
Note: This assumes that the longest number will be the first number it encounters with an underscore or a hyphen next to it, matching those two examples given.
Solution 2
\d{6,}
See the demo
Note: This assumes that the shortest number will never exceed 5 characters in length, and the longest number will never be shorter than 6 characters in length
Please, try.
Pure VB. No external libs or objects.
No brain-breaking regexp's patterns.
No string manipulations, so - speed. Superspeed. ~30 times faster than regexp :)
Easy transform on variouse needs.
For example, concatenate all digits from the source string to a single string.
Moreover, if target string is only intermediate step,
so it's possible to manipulate with numbers only.
Public Sub sb_BigNmb()
Dim sSrc$, sTgt$
Dim taSrc() As Byte, taTgt() As Byte, tLB As Byte, tUB As Byte
Dim s As Byte, t As Byte, tLenMin As Byte
tLenMin = 4
sSrc = "every99999994_54"
sTgt = vbNullString
taSrc = StrConv(sSrc, vbFromUnicode)
tLB = LBound(taSrc)
tUB = UBound(taSrc)
ReDim taTgt(tLB To tUB)
t = 0
For s = tLB To tUB
Select Case taSrc(s)
Case 48 To 57
taTgt(t) = taSrc(s)
t = t + 1
Case Else
If CBool(t) Then Exit For ' *** EXIT FOR ***
End Select
Next
If (t > tLenMin) Then
ReDim Preserve taTgt(tLB To (t - 1))
sTgt = StrConv(taTgt, vbUnicode)
End If
Debug.Print "'" & sTgt & "'"
Stop
End Sub
How to handle sSrc = "ev_1_ery99999994_54", please, make by yourself :)
.