Extracting Lines of data from a string with RegEx - regex

I have several strings, e.g.
(3)_(9)--(11).(FT-2)
(10)--(20).(10)/test--(99)
I am trying Regex.Match(here I do no know) to get a list like this:
First sample:
3
_
9
--
11
.
FT-1
Second Sample:
10
--
20
.
10
/test--
99
So there are several numbers in brackets and any text between them.
Can anyone help me doing this in vb.net? A given string returns this list?

One option is to use the Split method of [String]
"(3)_(9)--(11).(FT-2)".Split('()')
Another option is to match everything excluding ( and )
As regex, this would do [^()]+
Breakdown
"[^()]" ' Match any single character NOT present in the list “()”
"+" ' Between one and unlimited times, as many times as possible, giving back as needed (greedy)
You can use following block of code to extract all matches
Try
Dim RegexObj As New Regex("[^()]+", RegexOptions.IgnoreCase)
Dim MatchResults As Match = RegexObj.Match(SubjectString)
While MatchResults.Success
' matched text: MatchResults.Value
' match start: MatchResults.Index
' match length: MatchResults.Length
MatchResults = MatchResults.NextMatch()
End While
Catch ex As ArgumentException
'Syntax error in the regular expression
End Try

This should work:
Dim input As String = "(3)_(9)--(11).(FT-2)"
Dim searchPattern As String = "\((?<keep>[^)]+)\)|(?<=\))(?<keep>[^()]+)"
Dim replacementPattern As String = "${keep}" + Environment.NewLine
Dim output As String = RegEx.Replace(input, searchPattern, replacementPattern)

The simplest way is to use Regex.Split (formulated as a little console test):
Dim input = {"(3)_(9)--(11).(FT-2)", "(10)--(20).(10)/test--(99)"}
For Each s As String In input
Dim parts = Regex.Split(s, "\(|\)")
Console.WriteLine($"Input = {s}")
For Each p As String In parts
Console.WriteLine(p)
Next
Next
Console.ReadKey()
So basically we have a one-liner for the regex part.
The regular expression \(|\) means: split at ( or ) where the braces are escaped with \ because of their special meaning within regex.
The slightly shorter regex [()] where the desired characters are enclosed in [] would produce the same result.

Related

How to mimic regular Expression negative lookbehind?

What I'm trying to accomplish
I'm trying to create a function to use string interpolation within VBA. The issue I'm having is that I'm not sure how to replace "\n" with a vbNewLine, as long as it does not have the escape character "\" before it?
What I have found and tried
VBScript does not have a negative look behind as far as I can research.
Below has two examples of Patterns that I have already tried:
Private Sub testingInjectFunction()
Dim dict As New Scripting.Dictionary
dict("test") = "Line"
Debug.Print Inject("${test}1\n${test}2 & link: C:\\notes.txt", dict)
End Sub
Public Function Inject(ByVal source As String, dict As Scripting.Dictionary) As String
Inject = source
Dim regEx As Object
Set regEx = CreateObject("VBScript.RegExp")
regEx.Global = True
' PATTERN # 1 REPLACES ALL '\n'
'regEx.Pattern = "\\n"
' PATTERN # 2 REPLACES EXTRA CHARACTER AS LONG AS IT IS NOT '\'
regEx.Pattern = "[^\\]\\n"
' REGEX REPLACE
Inject = regEx.Replace(Inject, vbNewLine)
' REPLACE ALL '${dICT.KEYS(index)}' WITH 'dICT.ITEMS(index)' VALUES
Dim index As Integer
For index = 0 To dict.Count - 1
Inject = Replace(Inject, "${" & dict.Keys(index) & "}", dict.Items(index))
Next index
End Function
Desired result
Line1
Line2 & link: C:\notes.txt
Result for Pattern # 1: (Replaces when not wanted)
Line1
Line2 & link: C:\
otes.txt
Result for Pattern # 2: (Replaces the 1 in 'Line1')
Line
Line2 & link: C:\\notes.txt
Summary question
I can easily write code that doesn't use Regular Expressions that can achieve my desired goal but want to see if there is a way with Regular Expressions in VBA.
How can I use Regular Expressions in VBA to Replace "\n" with a vbNewLine, as long as it does not have the escape character "\" before it?
Yes, you may use a regex here. Since the backslash is not used to escape itself in these strings, you may modify your solution like this:
regEx.Pattern = "(^|[^\\])\\n"
S = regEx.Replace(S, "$1" & vbNewLine)
It will match and capture any char but \ before \n and then will put it back with the $1 placeholder. As there is a chance that \n appears at the start of the string, ^ - the start of string anchor - is added as an alternative into the capturing group.
Pattern details
(^|[^\\]) - Capturing group 1: start of string (^) or (|) any char but a backslash ([^\\])
\\ - a backslash
n - a n char.

VB.NET Regex Replacement

I have a list of fields named flavx(other text) that go 1 through 10.
For example, I might have:
flav2PGPct
I need to turn it to
flav12PGPct
I need to replaced 1 through 10 with 11 through 20 using VB.NET's Replace function with Regex, but I can't get it working right.
Can anyone help?
Here's what I've tried:
(\.)flav*[1-9]
I have no idea what to place in the replacement box...
Use this regex for search: (flav)(\d\w*) and this one for replace: ${1}1$2.
I'd use 2 regex runs to obtain the desired result because it is not possible to use a replacement literal with alternatives.
The first regex would replace 10 to 20 and the second will handle 1 to 9 digits:
Dim rx1to9 As Regex = New Regex("(?<=\D|^)[1-9](?=\D|$)") '1 - 9
Dim rx10 As Regex = New Regex("(?<=\D|^)10(?=\D|$)") '10
Dim str As String = "flav2PG10Pct101"
Dim result = rx10.Replace(str, "20")
result = rx1to9.Replace(result, "1$&")
Console.WriteLine(result)
See IDEONE demo (output is flav12PG20Pct101)
Regex explanation:
(?<=\D|^) - A positive look-behind that makes sure there is no digit (\D) or start of string (^) before...
[1-9] - a single digit from 1 to 9 (or, in the second regex, 10 matching literal 10)
(?=\D|$) - A positive look-ahead that makes sure there is no digit (\D) or the end of string ($) after the digit.
If you must check if flav is present in the string, you may use a bit different look-behind: (?<=flav\D*|^), or - if spaces should not occur between flav and the digit: (?<=flav[^\d\s]*|^).
Regexes work best with strings rather than numbers, so an easy way is to use a regex to get the parts of the string you want to adjust and then concatenate the calculated part in a string:
Option Strict On
Option Infer On
Imports System.Text.RegularExpressions
Module Module1
Sub Main()
Dim re As New Regex("^flav([0-9]+)(.*)$")
Dim s = "flav1PGPct"
Dim t = ""
Dim m = re.Match(s)
If m.Success Then
t = CStr(Integer.Parse(m.Groups(1).Value) + 10)
t = "flav" & t & m.Groups(2).Value
End If
Console.WriteLine(t)
Console.ReadLine()
End Sub
End Module

Extract variables from pattern matching

I'm not a match-pattern expert and I've been working on this for a few hours with no chance :/
I have an input string just like this:
Dim text As String = "32 Barcelona {GM C} 2 {*** Some ""cool"" text here}"
And I just want to extract 3 things:
Barcelona
GM C
*** Some "cool" text here
The pattern I'm trying is something like this:
Dim pattern As String = "^32\s(?<city>[^]].*\s)\{(?<titles>.*\})*"
Dim m As Match = Regex.Match(text, pattern)
If (m.Success) Then
Dim group1 As Group = m.Groups.Item("city")
Dim group2 As Group = m.Groups.Item("titles")
If group1.Success Then
MsgBox("City:" + group1.Value + ":", MsgBoxStyle.Information)
End If
If group2.Success Then
MsgBox(group2.Value, MsgBoxStyle.Information)
End If
Else
MsgBox("fail")
End If
But it's not working anyway :(
What should the pattern be to extract these 3 variables ?
^\d*(?<City>[A-Z a-z0-9]*)\s*\{(?<Titles>[A-Z a-z0-9]*)\}.*?\{(?<Cool>.*?)\}$
Seems to match your sample input.
Expresso is a great tool for designing regular expressions.

Split string on several words, and track which word split it where

I am trying to split a long string based on an array of words. For Example:
Words: trying, long, array
Sentence: "I am trying to split a long string based on an array of words."
Resulting string array:
I am
trying
to split a
long
string based on an
array
of words
Multiple instances of the same word is likely, so having two instances of trying cause a split, or of array, will probably happen.
Is there an easy way to do this in .NET?
The easiest way to keep the delimiters in the result is to use the Regex.Split method and construct a pattern using alternation in a group. The group is key to including the delimiters as part of the result, otherwise it will drop them. The pattern would look like (word1|word2|wordN) and the parentheses are for grouping. Also, you should always escape each word, using the Regex.Escape method, to avoid having them incorrectly interpreted as regex metacharacters.
I also recommend reading my answer (and answers of others) to a similar question for further details: How do I split a string by strings and include the delimiters using .NET?
Since I answered that question in C#, here's a VB.NET version:
Dim input As String = "I am trying to split a long string based on an array of words."
Dim words As String() = { "trying", "long", "array" }
If (words.Length > 0)
Dim pattern As String = "(" + String.Join("|", words.Select(Function(s) Regex.Escape(s)).ToArray()) + ")"
Dim result As String() = Regex.Split(input, pattern)
For Each s As String in result
Console.WriteLine(s)
Next
Else
' nothing to split '
Console.WriteLine(input)
End If
If you need to trim the spaces around each word being split you can prefix and suffix \s* to the pattern to match surrounding whitespace:
Dim pattern As String = "\s*(" + String.Join("|", words.Select(Function(s) Regex.Escape(s)).ToArray()) + ")\s*"
If you're using .NET 4.0 you can drop the ToArray() call inside the String.Join method.
EDIT: BTW, you need to decide up front how you want the split to work. Should it match individual words or words that are a substring of other words? For example, if your input had the word "belong" in it, the above solution would split on "long", resulting in {"be", "long"}. Is that desired? If not, then a minor change to the pattern will ensure the split matches complete words. This is accomplished by surrounding the pattern with a word-boundary \b metacharacter:
Dim pattern As String = "\s*\b(" + String.Join("|", words.Select(Function(s) Regex.Escape(s)).ToArray()) + ")\b\s*"
The \s* is optional per my earlier mention about trimming.
You could use a regular expression.
(.*?)((?:trying)|(?:long)|(?:array))(.*)
will give you three groups if it matches:
1) The bit before the first instance of any of the split words.
2) The split word itself.
3) The rest of the string.
You can keep matching on (3) until you run out of matches.
I've played around with this but I can't get a single regex that will split on all instances of the target words. Maybe someone with more regex-fu can explain how.
I've assumed that VB has regex support. If not, I'd recommend using a different language. Certainly C# has regexes.
You can split with " ",
and than go through the words and see which one is contained in the "splitting words" array
Dim testS As String = "I am trying to split a long string based on an array of words."
Dim splitON() As String = New String() {"trying", "long", "array"}
Dim newA() As String = testS.Split(splitON, StringSplitOptions.RemoveEmptyEntries)
Something like this
Dim testS As String = "I am trying to split a long string based on a long array of words."
Dim splitON() As String = New String() {"long", "trying", "array"}
Dim result As New List(Of String)
result.Add(testS)
For Each spltr As String In splitON
Dim NewResult As New List(Of String)
For Each s As String In result
Dim a() As String = Strings.Split(s, spltr)
If a.Length <> 0 Then
For z As Integer = 0 To a.Length - 1
If a(z).Trim <> "" Then NewResult.Add(a(z).Trim)
NewResult.Add(spltr)
Next
NewResult.RemoveAt(NewResult.Count - 1)
End If
Next
result = New List(Of String)
result.AddRange(NewResult)
Next
Peter, I hope the below would be suitable for Split string by array of words using Regex
// Input
String input = "insert into tbl1 inserttbl2 insert into tbl2 update into tbl3
updatededle into tbl4 update into tbl5";
//Regex Exp
String[] arrResult = Regex.Split(input, #"\s+(?=(?:insert|update|delete)\s+)",
RegexOptions.IgnoreCase);
//Output
[0]: "insert into tbl1 inserttbl2"
[1]: "insert into tbl2"
[2]: "update into tbl3 updatededle into tbl4"
[3]: "update into tbl5"

Parsing Excel reference with regular expression?

Excel returns a reference of the form
=Sheet1!R14C1R22C71junk
("junk" won't normally be there, but I want to be sure that there's no extraneous text.)
I would like to 'split' this into a VB array, where
a(0)="Sheet1"
a(1)="14"
a(2)="1"
a(3)="22"
a(4)="71"
a(5)="junk"
I'm sure it can be done easily with a regular expression, but I just can't get the hang of it.
Is there a kind soul who could help me?
Thanks
=([^!]+)!R(\d+)C(\d+)R(\d+)C(\d+)(.*)
should work.
[^!]+ matches a sequence of non-exclamation-point characters.
\d+ matches a sequence of digits.
.* matches anything.
So, in VB.NET:
Dim a As Match
a = Regex.Match(SubjectString, "=([^!]+)!R(\d+)C(\d+)R(\d+)C(\d+)(.*)")
If a.Success Then
' matched text: a.Value
' backreference n text: a.Groups(n).Value
Else
' Match attempt failed
End If
A straightforward String.Split would work, provided the "junk" text wasn't there:
Dim input As String = "=Sheet1!R14C1R22C71"
Dim result = input.Split(New Char() { "="c, "!"c, "R"c, "C"c }, StringSplitOptions.RemoveEmptyEntries)
For Each item As String In result
Console.WriteLine(item)
Next
The regex gets a little tricky since you will need to go through the Groups and Captures of the nested portions to get the proper order.
EDIT: here's my regex solution. It accepts multiple occurrences of R's and C's.
Dim input As String = "=Sheet1!R14C1R22C71junk"
Dim pattern As String = "=(?<Sheet>Sheet\d+)!(?:R(?<R>\d+)C(?<C>\d+))+"
Dim m As Match = Regex.Match(input, pattern)
If m.Success Then
Console.WriteLine(m.Groups("Sheet").Value)
For i = 0 To m.Groups("R").Captures.Count - 1
Console.WriteLine(m.Groups("R").Captures(i).Value)
Console.WriteLine(m.Groups("C").Captures(i).Value)
Next
End If
Pattern explanation:
"=(?Sheet\d+)" : matches an = sign followed by "Sheet" and digits. Uses named group of "Sheet"
"!(?:R(?\d+)C(?\d+))+" : matches the exclamation mark followed by at least one occurrence of the *R*xx*C*xx portion of the text. Named groups of "R" and "C" are used.
"(?:...)+" : this portion from the above portion matches but does not capture the inner pattern (i.e., the R/C part). This is to avoid unnecessarily capturing them while we are actually capturing them with the named groups.
More general regexes for R1C1 style:
^=(?:(?<Sheet>[^!]+)!)?(?:R((?<RAbs>\d+)|(?<RRel>\[-?\d+\]))C((?<CAbs>\d+)|(?<CRel>\[-?\d+\]))){1,2}$
And A1 style:
^=(?:(?<Sheet>[^!]+)!)?(?:(?<Col1>\$?[a-z]+)(?<Row1>\$?\d+))(?:\:(?<Col2>\$?[a-z]+)(?<Row2>\$?\d+))?$
It doesn't match external references like =[Book1]Sheet1!A1 though.