Pretty String Manipulation - regex

I have the following string which I wish to extract parts from:
<FONT COLOR="GREEN">201 KAR 2:340.</FONT>
In this particular case, I wish to extract the numbers 201,2, and 340, which I will later use to concatenate to form another string:
http://www.lrc.state.ky.us/kar/201/002/340reg.htm
I have a solution, but it is not easily readable, and it seems rather clunky. It involves using the mid function. Here it is:
intTitle = CInt(Mid(strFontTag,
InStr(strFontTag, ">") + 1,
(InStr(strFontTag, "KAR") - InStr(strFontTag, ">"))
- 3))
I would like to know if perhaps there is a better way to approach this task. I realize I could make some descriptive variable names, like intPosOfEndOfOpeningFontTag to describe what the first InStr function does, but it still feels clunky to me.
Should I be using some sort of split function, or regex, or some more elegant way that I have not come across yet? I have been manipulating strings in this fashion for years, and I just feel there must be a better way. Thanks.

<FONT[^>]*>[^\d]*(\d+)[^\d]*(\d+):(\d+)[^\d]*</FONT>

The class
Imports System
Imports System.IO
Imports System.Text
Imports System.Text.RegularExpressions
Imports System.Xml
Imports System.Xml.Linq
Imports System.Linq
Public Class clsTester
'methods
Public Sub New()
End Sub
Public Function GetTitleUsingRegEx(ByVal fpath$) As XElement
'use this function if your input string is not a well-formed
Dim result As New XElement(<result/>)
Try
Dim q = Regex.Matches(File.ReadAllText(fpath), Me.titPattern1, RegexOptions.None)
For Each mt As Match In q
Dim t As New XElement(<title/>)
t.Add(New XAttribute("name", mt.Groups("name").Value))
t.Add(New XAttribute("num1", mt.Groups("id_1").Value))
t.Add(New XAttribute("num2", mt.Groups("id_2").Value))
t.Add(New XAttribute("num3", mt.Groups("id_3").Value))
t.Add(mt.Value)
result.Add(t)
Next mt
Return result
Catch ex As Exception
result.Add(<error><%= ex.ToString %></error>)
Return result
End Try
End Function
Public Function GetTitleUsingXDocument(ByVal fpath$) As XElement
'use this function if your input string is well-formed
Dim result As New XElement(<result/>)
Try
Dim q = XElement.Load(fpath).Descendants().Where(Function(c) Regex.IsMatch(c.Name.LocalName, "(?is)^font$")).Where(Function(c) Regex.IsMatch(c.Value, Me.titPattern2, RegexOptions.None))
For Each nd As XElement In q
Dim s = Regex.Match(nd.Value, Me.titPattern2, RegexOptions.None)
Dim t As New XElement(<title/>)
t.Add(New XAttribute("name", s.Groups("name").Value))
t.Add(New XAttribute("num1", s.Groups("id_1").Value))
t.Add(New XAttribute("num2", s.Groups("id_2").Value))
t.Add(New XAttribute("num3", s.Groups("id_3").Value))
t.Add(nd.Value)
result.Add(t)
Next nd
Return result
Catch ex As Exception
result.Add(<error><%= ex.ToString %></error>)
Return result
End Try
End Function
'fields
Private titPattern1$ = "(?is)(?<=<font[^<>]*>)(?<id_1>\d+)\s+(?<name>[a-z]+)\s+(?<id_2>\d+):(?<id_3>\d+)(?=\.?</font>)"
Private titPattern2$ = "(?is)^(?<id_1>\d+)\s+(?<name>[a-z]+)\s+(?<id_2>\d+):(?<id_3>\d+)\.?$"
End Class
The usage
Sub Main()
Dim y = New clsTester().GetTitleUsingRegEx("C:\test.htm")
If y.<error>.Count = 0 Then
Console.WriteLine(String.Format("Result from GetTitleUsingRegEx:{0}{1}", vbCrLf, y.ToString))
Else
Console.WriteLine(y...<error>.First().Value)
End If
Console.WriteLine("")
Dim z = New clsTester().GetTitleUsingXDocument("C:\test.htm")
If z.<error>.Count = 0 Then
Console.WriteLine(String.Format("Result from GetTitleUsingXDocument:{0}{1}", vbCrLf, z.ToString))
Else
Console.WriteLine(z...<error>.First().Value)
End If
Console.ReadLine()
End Sub
Hope this helps.

regex pattern: <FONT[^>]*>.*?(\d+).*?(\d+).*?(\d+).*?<\/FONT>

I think #Jean-François Corbett has it right.
Hide it away in a function and never look back
Change your code to this:
intTitle = GetCodesFromColorTag("<FONT COLOR="GREEN">201 KAR 2:340.</FONT>")
Create a new function:
Public Function GetCodesFromColorTag(FontTag as String) as Integer
Return CInt(Mid(FontTag, InStr(FontTag, ">") + 1,
(InStr(FontTag, "KAR") - InStr(FontTag, ">"))
- 3))
End Function

Related

Index out of bounds error using Regex Split

Posting another question here since last time I did the people who answered were extremely helpful. Bear in mind, I'm relatively new to VB.net.
So I'm working on a program that pulls the first and third columns out of a text file using Regex.Split to eliminate the multiple spaces between the alphanumeric characters in the file.
A high level example of what the text file looks like is here:
VARIABLE1 MEAS1 STORAGE1
VARIABLE2 MEAS2 STORAGE2
VARIABLE3 MEAS3 STORAGE3
VARIABLE4 MEAS4 STORAGE4
VARIABLE5 MEAS5 STORAGE5
VARIABLE6 MEAS6 STORAGE6
#VARIABLE7 MEAS7 STORAGE7
VARIABLE8 MEAS8 STORAGE8
VARIABLE9 MEAS9 STORAGE9
VARIABLE10 MEAS10 STORAGE10
VARIABLE11 MEAS11 STORAGE11
VARIABLE12 MEAS12 STORAGE12
VARIABLE13 MEAS13 STORAGE13
VARIABLE14 MEAS14 STORAGE14
The file uses the "#" to denote comments in the file, so in my code I tell the System.IO to ignore that character.
However, when creating a test function to try this, I continuously get an Index out of bounds error, (only on some files. Some in this format work fine, for some reason)
When looking through the execution output, I am receiving the error after it writes the "STORAGE6" line, so there has to be an error traversing from STORAGE6 to VARIABLE7, and I can't quite figure it out. Any insight on this would be extremely appreciated!
The test function I have written is below:
Public Function Testing()
OpenFileDialog1.ShowDialog()
Dim file = System.IO.File.ReadAllLines(OpenFileDialog1.FileName)
For Each line In file
Dim arrWords() As String = System.Text.RegularExpressions.Regex.Split(line, "\s+")
Dim upBound = arrWords.GetUpperBound(0)
If upBound <> 0 Then
If line.Contains("#") Or line.Length = 0 Then
Else
Console.WriteLine(arrWords(0) + " " + arrWords(2))
End If
End If
Next
End Function
I get the out of bounds error when calling "arrWords(2)," which I'm sure was pretty obvious, but just trying to make the question as detailed as possible.
The simple fix is changing these two lines:
If upBound <> 0 Then
If line.Contains("#") Or line.Length = 0 Then
like this:
If upBound > 0 Then
If line.TrimStart().StartsWith("#") OrElse String.IsNullOrWhitespace(line) Then
But I'd really do something more like this:
Public Class DataItem
Public Property Variable As String
Public Property Measure As String
Public Property Storage As String
End Class
Public Function ReadDataFile(fileName As String) As IEnumerable(Of DataItem)
Return File.ReadLines(fileName).
Where(Function(line) Not line.TrimStart().StartsWith("#") AndAlso Not String.IsNullorWhitespace(line)).
Select(Function(line) System.Text.RegularExpressions.Regex.Split(line, "\s+")).
Where(Function(fields) fields.Length = 3).
Select(Function(fields)
Return New DataItem With {
.Variable = fields(0),
.Measure = fields(1),
.Storage = fields(2)}
End Function)
End Function
Public Function Testing()
If OpenFileDialog1.ShowDialog() = DialogResult.OK Then
Dim records = ReadDataFile(OpenFileDialog1.FileName)
For Each record in records
Console.WriteLine($"{record.Variable} {record.Storage}")
Next
End If
End Function

Multiply only numbers in a mixed string [VB.Net]

Let's say I have a string "N4NSD3MKF34MKMKFM53" and i want to multiply the string * 2 to get
N8NSD6MKF68MKMKFM106 How would I go about doing this?
Ok, I might as well give you the Regex solution as long as I'm here. But I caution you not to use it unless you understand what it's doing. It's never a good idea to just copy and paste code that you don't fully understand.
Dim input As String = "N4NSD3MKF34MKMKFM53"
Dim output As String = Regex.Replace(
input,
"\d+",
Function(x) (Integer.Parse(x.Value) * 2).ToString())
You can try the following code:
Public Class Program
Public Shared Sub Main(args As String())
Const expression As String = "N4NSD3MKF34MKMKFM53"
Dim result = MultiplyExpression.Calculate(expression)
Console.WriteLine(result)
End Sub
End Class
Class MultiplyExpression
Public Shared Function Calculate(expression As String) As String
Dim result = String.Empty
For Each c In expression
Dim num As Integer
If Int32.TryParse(c.ToString(), num) Then
result += (num * 2).ToString()
Else
result += c
End If
Next
Return result
End Function
End Class
Output: N8NSD6MKF68MKMKFM106

Linq with HashTable Matching

I need another pair of eyes. I've been playing around with this LINQ syntax for scanning a Hashtable with a regular express. Can't seem to get it quite right. The goal is to match all keys to a regular expression, then using those results match the remaining values to an separate regular expression. In the test case below, I should end up with the first three entries.
Private ReadOnly Property Testhash As Hashtable
Get
Testhash = New Hashtable
Testhash.Add("a1a", "abc")
Testhash.Add("a2a", "aac")
Testhash.Add("a3a", "acc")
Testhash.Add("a4a", "ade")
Testhash.Add("a1b", "abc")
Testhash.Add("a2b", "aac")
Testhash.Add("a3b", "acc")
Testhash.Add("a4b", "ade")
End Get
End Property
Public Sub TestHashSearch()
Dim KeyPattern As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex("a.a")
Dim ValuePattern As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex("a.c")
Try
Dim queryMatchingPairs = (From item In Testhash
Let MatchedKeys = KeyPattern.Matches(item.key)
From key In MatchedKeys
Let MatchedValues = ValuePattern.Matches(key.value)
From val In MatchedValues
Select item).ToList.Distinct
Dim info = queryMatchingPairs
Catch ex As Exception
End Try
End Sub
Can't you match both the key and value at the same time?
Dim queryMatchingPairs = (From item In Testhash
Where KeyPattern.IsMatch(item.Key) And ValuePattern.IsMatch(item.Value)
Select item).ToList
I should have taken a break sooner, then worked a little more. The correct solution uses the original "from item" and not the lower "from key" in the second regular expression. Also, "distinct" is unnecessary for a hashtable.
Dim queryMatchingPairs = (From item In Testhash
Let MatchedKeys = KeyPattern.Matches(item.key)
From key In MatchedKeys
Let MatchedValues = ValuePattern.Matches(item.value)
From val In MatchedValues
Select item).ToList

Split comma delimited string to array using regex

I have a string as below, which needs to be split to an array, using VB.NET
10,"Test, t1",10.1,,,"123"
The result array must have 6 rows as below
10
Test, t1
10.1
(empty)
(empty)
123
So:
1. quotes around strings must be removed
2. comma can be inside strings, and will remain there (row 2 in result array)
3. can have empty fields (comma after comma in source string, with nothing in between)
Thanks
Don't use String.Split(): it's slow, and doesn't account for a number of possible edge cases.
Don't use RegEx. RegEx can be shoe-horned to do this accurately, but to correctly account for all the cases the expression tends to be very complicated, hard to maintain, and at this point isn't much faster than the .Split() option.
Do use a dedicated CSV parser. Options include the Microsoft.VisualBasic.TextFieldParser type, FastCSV, linq-to-csv, and a parser I wrote for another answer.
You can write a function yourself. This should do the trick:
Dim values as New List(Of String)
Dim currentValueIsString as Boolean
Dim valueSeparator as Char = ","c
Dim currentValue as String = String.Empty
For Each c as Char in inputString
If c = """"c Then
If currentValueIsString Then
currentValueIsString = False
Else
currentValueIsString = True
End If
End If
If c = valueSeparator Andalso not currentValueIsString Then
If String.IsNullOrEmpty(currentValue) Then currentValue = "(empty)"
values.Add(currentValue)
currentValue = String.Empty
End If
currentValue += c
Next
Here's another simple way that loops by the delimiter instead of by character:
Public Function Parser(ByVal ParseString As String) As List(Of String)
Dim Trimmer() As Char = {Chr(34), Chr(44)}
Parser = New List(Of String)
While ParseString.Length > 1
Dim TempString As String = ""
If ParseString.StartsWith(Trimmer(0)) Then
ParseString = ParseString.TrimStart(Trimmer)
Parser.Add(ParseString.Substring(0, ParseString.IndexOf(Trimmer(0))))
ParseString = ParseString.Substring(Parser.Last.Length)
ParseString = ParseString.TrimStart(Trimmer)
ElseIf ParseString.StartsWith(Trimmer(1)) Then
Parser.Add("")
ParseString = ParseString.Substring(1)
Else
Parser.Add(ParseString.Substring(0, ParseString.IndexOf(Trimmer(1))))
ParseString = ParseString.Substring(ParseString.IndexOf(Trimmer(1)) + 1)
End If
End While
End Function
This returns a list. If you must have an array just use the ToArray method when you call the function
Why not just use the split method?
Dim s as String = "10,\"Test, t1\",10.1,,,\"123\""
s = s.Replace("\"","")
Dim arr as String[] = s.Split(',')
My VB is rusty so consider this pseudo-code

Match "THIS" And Replace with "THAT" RegEx Vb.Net

Trying to find out how to find and replace text with corresponding values.
For Example
1) fedex to FedEx
2) nasa to NASA
3) po box to PO BOX
Public Function FindReplace(ByVal s As String) As String
Dim MatchEval As New MatchEvaluator(AddressOf RegexReplace)
Dim Pattern As String = "(?<f1>fedex|nasa|po box)"
Return Regex.Replace(s, Pattern, MatchEval, RegexOptions.IgnoreCase)
End Function
Public Function RegexReplace(ByVal m As Match) As String
Select Case LCase(m.Groups("f1").Value)
Case "fedex"
Return "FedEx"
Case "nasa"
Return "NASA"
Case "po box"
Return "PO BOX"
End Select
End Function
The above code is working fine for fixed values but don't know how to use the above code to match added values on run-time like db to Db.
I'd guess, that the only thing here you need Regex for is IgnoreCase option. If so, then I would like to suggest not to use Regex at all. Use String functionality instead:
Dim input As String = "fEDeX"
Dim pattern As String = "fedex"
Dim replacement As String = "FedEx"
Dim result As String
result = input.ToLowerInvariant().Replace(pattern, replacement)
But if you still need Regex, then this should work:
result = Regex.Replace(input, pattern, replacement, RegexOptions.IgnoreCase)
Example:
Sub Main()
Dim replacements As New Dictionary(Of String, String)()
replacements.Add("fedex", "FedEx")
replacements.Add("nasa", "NASA")
replacements.Add("po box", "PO BOX")
Dim result As String = Replace("fedex, nAsA, po box, etc", replacements)
End Sub
Private Function Replace(ByVal input As String, ByVal replacements As Dictionary(Of String, String)) As String
For Each item In replacements
input = Regex.Replace(input, item.Key, item.Value, RegexOptions.IgnoreCase)
Next
Return input
End Function
Found the solution by using List and did the performance test against dictionary object suggested by Anton Kedrov both methods takes almost same time to complete but i don't know the dictionary method will be good or not for longer replacement list because it loop through all the list to find the match entry for replacement.
I thank you all for your suggestion and advice.
Sub Main()
Dim lst As New List(Of String)
lst.Add("NASA")
lst.Add("FedEx")
lst.Add("PO BOX")
MsgBox(FindReplace("this is testing fedex naSa PO box"))
End Sub
Public Function FindReplace(ByVal s As String) As String
Dim Pattern As String = "(?<f1>fedex|nasa|po box)"
Dim MatchEval As New MatchEvaluator(AddressOf RegexReplace)
Return Regex.Replace(s, Pattern, MatchEval, RegexOptions.IgnoreCase)
End Function
Public Function RegexReplace(ByVal m As Match) As String
Dim Found As String
Found = lst.Find(Function(value As String) LCase(value) = LCase(m.Groups("f1").Value))
Return Found
End Function