Regular Expression - Applied to a Text File - regex

I have a text file with the following structure:
KEYWORD0 DataKey01-DataValue01 DataKey02-DataValue02 ... DataKey0N-DataValue0N
KEYWORD1 DataKey11-DataValue11 DataKey12-DataValue12 DataKey13-DataValue13
_________DataKey14-DataValue14 DataKey1N-DataValue1N (1)
// It is significant that the additional datakeys are on a new line
(1) the underline is not part of the data. I used it to align the data.
Question: How do I use a regex to convert my data to this format?
<KEYWORD0>
<DataKey00>DataValue00</DataKey00>
<DataKey01>DataValue01</DataKey01>
<DataKey02>DataValue02</DataKey02>
<DataKey0N>DataValue0N</DataKey0N>
</KEYWORD0>
<KEYWORD1>
<DataKey10>DataValue10</DataKey10>
<DataKey11>DataValue11</DataKey11>
<DataKey12>DataValue12</DataKey12>
<DataKey13>DataValue12</DataKey13>
<DataKey14>DataValue12</DataKey14>
<DataKey1N>DataValue1N</DataKey1N>
</KEYWORD1>

Regex is for masochists, it's a very simple text parser in VB.NET (converted from C# so check for bugs):
Public Class MyFileConverter
Public Sub Parse(inputFilename As String, outputFilename As String)
Using reader As New StreamReader(inputFilename)
Using writer As New StreamWriter(outputFilename)
Parse(reader, writer)
End Using
End Using
End Sub
Public Sub Parse(reader As TextReader, writer As TextWriter)
Dim line As String
Dim state As Integer = 0
Dim xmlWriter As New XmlTextWriter(writer)
xmlWriter.WriteStartDocument()
xmlWriter.WriteStartElement("Keywords")
' Root element required for conformance
While (InlineAssignHelper(line, reader.ReadLine())) IsNot Nothing
If line.Length = 0 Then
If state > 0 Then
xmlWriter.WriteEndElement()
End If
state = 0
Continue While
End If
Dim parts As String() = line.Split(Function(c) [Char].IsWhiteSpace(c), StringSplitOptions.RemoveEmptyEntries)
Dim index As Integer = 0
If state = 0 Then
state = 1
xmlWriter.WriteStartElement(parts(System.Math.Max(System.Threading.Interlocked.Increment(index),index - 1)))
End If
While index < parts.Length
Dim keyvalue As String() = parts(index).Split("-"C)
xmlWriter.WriteStartElement(keyvalue(0))
xmlWriter.WriteString(keyvalue(1))
xmlWriter.WriteEndElement()
index += 1
End While
End While
If state > 0 Then
xmlWriter.WriteEndElement()
End If
xmlWriter.WriteEndElement()
xmlWriter.WriteEndDocument()
End Sub
Private Shared Function InlineAssignHelper(Of T)(ByRef target As T, value As T) As T
target = value
Return value
End Function
End Class
Note that I added a root element to the XML because .Net XML objects only like reading and writing conformant XML.
Also note that the code uses an extension I wrote for String.Split.

^(\w)\s*((\w)\s*)(\r\n^\s+(\w)\s*)*
This is starting to get in the neighborhood but I think this is just easier to do in a programming language... just process the file line by line...

You need to use the Groups and Matches feature of Regex in .NET and apply something like:
([A-Z\d]+)(\s([A-Za-z\d]+)\-([A-Za-z\d]+))*
Find a Match and select the first Gruop to find the KEYWORD
Loop through the Matches of Group 3 and 4 to catch the DataKey and DataValue for that KEYWORD
Go to 1

If the DataValue and DataKey items don't can't contain < or > or '-' chars or spaces you can do something like this:
Read your file in a string and to a replaceAll with a regex similar to this: ([^- \t]+)-([^- \t]+) and use this as a replacement (<$1>$2</$1>). This will convert something like this: DataKey01-DataValue01 into something like this: <DataKey01>DataValue01</DataKey01>.
After that you need to run another global replace but this regex ^([^ \t]+)(\s+(?:<[^>]+>[^<]+</[^>]+>[\s\n]*)+) and replace with <$1>$2</$1> again.
This should do the trick.
I don't program in VB.net so i have no idea if the actual syntax is correct (you might need to double or quadruple the \ in some cases). You should make sure the enable the Multiline option for the second pass.
To explain:
([^- \t]+)-([^- \t]+)
([^- \t]+) will match any string of chars not containing or - or \t. This is marked as $1 (notice the parentheses around it)
- will match the - char
([^- \t]+) will again match any string of chars not containing or - or \t. This is also marked as $2 (notice the parentheses around it)
The replacement will just convert a ab-cd string matched with <ab>cd</ab>
After this step the file looks like:
KEYWORD0 <DataKey00>DataValue00</DataKey00> <DataKey01>DataValue01</DataKey01>
<DataKey02>DataValue02</DataKey02> <DataKey0N>DataValue0N</DataKey0N>
KEYWORD1 <DataKey10>DataValue10</DataKey10> <DataKey11>DataValue11</DataKey11>
<DataKey12>DataValue12</DataKey12> <DataKey13>DataValue12</DataKey13>
<DataKey14>DataValue12</DataKey14> <DataKey1N>DataValue1N</DataKey1N>
^([^ \t]+)(\s+(?:<[^>]+>[^<]+</[^>]+>[\s\n]*)+)
^([^ \t]+) mark and match any string of non or \t beginning at the line (this is $1)
( begin a mark
\s+ white space
(?: non marked group starting here
<[^>]+> match an open xml tag: <ab>
[^<]+ match the inside of a tag bc
</[^>]+> match an closing tag </ab>
[\s\n]* some optional white space or newlines
)+ close the non marked group and repeat at least one time
) close the mark (this is $2)
The replacement is straight forward now.
Hope it helps.
But you should probably try to make a simple parser if this is not a one off job :)

Related

How to split a string in VBA to array by Split function delimited by Regular Expression

I am writing an Excel Add In to read a text file, extract values and write them to an Excel file. I need to split a line, delimited by one or more white spaces and store it in the form of array, from which I want to extract desired values.
I am trying to implement something like this:
arrStr = Split(line, "/^\s*/")
But the editor is throwing an error while compiling.
How can I do what I want?
If you are looking for the Regular Expressions route, then you could do something like this:
Dim line As String, arrStr, i As Long
line = "This is a test"
With New RegExp
.Pattern = "\S+"
.Global = True
If .test(line) Then
With .Execute(line)
ReDim arrStr(.Count - 1)
For i = 0 To .Count - 1
arrStr(i) = .Item(i)
Next
End With
End If
End With
IMPORTANT: You will need to create a reference to:
Microsoft VBScript Regular Expressions 5.5 in Tools > References
Otherwise, you can see Late Binding below
Your original implementation of your original pattern \^S*\$ had some issues:
S* was actually matching a literal uppercase S, not the whitespace character you were looking for - because it was not escaped.
Even if it was escaped, you would have matched every string that you used because of your quantifier: * means to match zero or more of \S. You were probably looking for the + quantifier (one or more of).
You were good for making it greedy (not using *?) since you were wanting to consume as much as possible.
The Pattern I used: (\S+) is placed in a capturing group (...) that will capture all cases of \S+ (all characters that are NOT a white space, + one or more times.
I also used the .Global so you will continue matching after the first match.
Once you have captured all your words, you can then loop through the match collection and place them into an array.
Late Binding:
Dim line As String, arrStr, i As Long
line = "This is a test"
With CreateObject("VBScript.RegExp")
.Pattern = "\S+"
.Global = True
If .test(line) Then
With .Execute(line)
ReDim arrStr(.Count - 1)
For i = 0 To .Count - 1
arrStr(i) = .Item(i)
Next
End With
End If
End With
Miscellaneous Notes
I would have advised just to use Split(), but you stated that there were cases where more than one consecutive space may have been an issue. If this wasn't the case, you wouldn't need regex at all, something like:
arrStr = Split(line)
Would have split on every occurance of a space

Using regex to find paragraphs in VBA Excel

I am trying to use regex to 'extract' paragraphs in a document. Each paragraph is preceded and followed by a '-' on separate line and each paragraph starts with a number.
For example
-
1. This is a paragraph
It may go over multiple lines
-
Ideally, I would like to not include the '-', but it doesn't really matter as I will be placing it in a string and running another regex against it (One that I know works)
The code I am trying to use is basically as follows
Dim matchPara as Object
Dim regex as Object
Dim theMatch as Object
Dim matches as Object
Dim fileName as String
Dim fileNo as Integer
Dim document as String
matchPara = "-?(\d.*?)?-"
Set regex = CreateObject("VBScript.RegExp")
regex.Pattern = matchPara
regex.Global = True
regex.Multiline = True
fileName = "C:\file.txt"
fileNo = FreeFile
Open fileName For Input As #fileNo
document = Input$(LOF(fileNo), fileNo)
set matches = regex.Execute(document)
For Each theMatch in matches
MsgBox(theMatch.Value)
Next theMatch
Close #fileNo
I have tested this regex on regex101 and it appeared to do what I wanted. I have also tested it without the grouping
-?\d.*?-
However when I run the code the theMatch.Value only ever contains a single '-'. After some messing around with the regex I got it to display the first line of text but never any more than the first line.
I have checked the length of theMatch.Value with:
MsgBox(len(theMatch.Value))
and placed the contents of theMatch.Value in a cell on the worksheet to see if It was cutting off in the message box, but both theories were proved wrong.
I am at a complete loss now and I am beginning to suspect it is possibly a VBA thing and not a regex thing. There is no requirement to use regex, I just assumed it would be the easiest thing to do.
The paragraphs contain data that I am trying to extract. So the idea was to regex each paragraph out place that in a string then run other regex to get the information that I need. Some paragraphs wont contain the data that I need so the idea was to loop through each individual paragraph and then error handle better if the data I need wasn't in that paragraph (ie get what I can and drop the rest with an error message)
Here is a screenshot:
This simple approach does not use Regex. It assumes the data is in column A and the paragraphs are placed in column B:
Sub paragraph_no_regex()
Dim s As String
Dim ary
With Application.WorksheetFunction
s = .TextJoin(" ", False, Columns(1).SpecialCells(2))
End With
ary = Split(s, "-")
i = 1
For Each a In ary
Cells(i, 2) = a
i = i + 1
Next a
End Sub
Sub F()
Dim re As New RegExp
Dim sMatch As String
Dim document As String
re.Pattern = "-\n((.|\n)+?)\n-"
'Getting document
document = ...
sMatch = re.Execute(document)(0).SubMatches(0)
End Sub
If you need dashes -, then just include them into capture group (the outer parenthesis).
This RegEx matches your description and successfully extracts paragraphs (as tested on regex101.com):
matchPara = "-\n\d+\.\s*((?:.|\n)+?)\s*\n-"
It needs the 'global' flag but not the 'multiline' flag. Instead, the end-of-line token is matched in the regex. The main point is that the innermost matching group will match any character including the end-of-line (given as an alternative) but does so in a non-greedy way ("+?"). It doesn't care about word boundaries as this is not necessary here. Also, "-" is not a special character where used in the regex so it doesn't have to be escaped.
As added benefit leading and trailing whitespace is cut off ("\s*" outside the group).

Incorrect use of regex wildcards

This is not correct use of wildcards ? I'm attempting to match String that contains a date. I don't want to include the date in the returned String or the String value that prepends the matched String.
object FindText extends App{
val toFind = "find1"
val line = "this is find1 the line 1 \n 21/03/2015"
val find = (toFind+".*\\d{2}/\\d{2}/\\d{4}").r
println(find.findFirstIn(line))
}
Output should be : "find1 the line 1 \n "
but String is not found.
Dot does not match newline characters by default. You can set a DOTALL flag to make it happen (I have also added a "positive look-ahead - the (?=...) thingy - since you did not want the date to be included in the match": val find = (toFind+"""(?s).*(?=\d{2}/\d{2}/\d{4})""").r
(Note also, that in scala you do not need to escape special characters in strings, enclosed in a triple-quote pairs ... pretty neat).
The problem lies with the newline in the test string. A .* does not match newlines apparently. Replacing this with .*\\n?.* should fix it. One could also use a multiline flag in the regex such as:
val find = ("(?s)"+toFind+".*\\d{2}/\\d{2}/\\d{4}").r

Visual Studio regex to remove all comments and blank lines in VB.NET code using a macro

I was trying to remove all comments and empty lines in a file with the help of a macro. Now I came up with this solution which deletes the comments(there is some bug described below) but is not able to delete the blank lines in between -
Sub CleanCode()
Dim regexComment As String = "(REM [\d\D]*?[\r\n])|(?<SL>\'[\d\D]*?[\r\n])"
Dim regexBlank As String = "^[\s|\t]*$\n"
Dim replace As String = ""
Dim selection As EnvDTE.TextSelection = DTE.ActiveDocument.Selection
Dim editPoint As EnvDTE.EditPoint
selection.StartOfDocument()
selection.EndOfDocument(True)
DTE.UndoContext.Open("Custom regex replace")
Try
Dim content As String = selection.Text
Dim resultComment As String = System.Text.RegularExpressions.Regex.Replace(content, regexComment, replace)
Dim resultBlank As String = System.Text.RegularExpressions.Regex.Replace(resultComment, regexBlank, replace)
selection.Delete()
selection.Collapse()
Dim ed As EditPoint = selection.TopPoint.CreateEditPoint()
ed.Insert(resultBlank)
Catch ex As Exception
DTE.StatusBar.Text = "Regex Find/Replace could not complete"
Finally
DTE.UndoContext.Close()
DTE.StatusBar.Text = "Regex Find/Replace complete"
End Try
End Sub
So, here is what it should looks like before and after running the macro.
BEFORE
Public Class Class1
Public Sub New()
''asdasdas
Dim a As String = "" ''asdasd
''' asd ad asd
End Sub
Public Sub New(ByVal strg As String)
Dim a As String = ""
End Sub
End Class
AFTER
Public Class Class1
Public Sub New()
Dim a As String = ""
End Sub
Public Sub New(ByVal strg As String)
Dim a As String = ""
End Sub
End Class
There are mainly two main problems with the macro
It cannot delete the blank lines in between.
If there is a piece of code which goes like this
Dim a as String = "Name='Soham'"
Then After running the macro it becomes
Dim a as String = "Name='"
To get rid of a line that contains whitespace or nothing, you can use this regex:
(?m)^[ \t]*[\r\n]+
Your regex, ^[\s|\t]*$\n would work if you specified Multiline mode ((?m)), but it's still incorrect. For one thing, the | matches a literal |; there's no need to specify "or" in a character class. For another, \s matches any whitespace character, including TAB (\t), carriage-return (\r), and linefeed (\n), making it needlessly redundant and inefficient. For example, at the first blank line (after the end of the first Sub), the ^[\s|\t]* will initially try to match everything before the word Public, then it will back off to the end of the previous line, where the $\n can match.
But a blank line, in addition to being empty or containing only horizontal whitespace (spaces or TABs), may also contain a comment. I choose to treat these "comment-only" lines as blank lines because it's relatively easy to do, and it simplifies the task of matching comments in non-blank lines, which is much harder. Here's my regex:
^[ \t]*(?:(?:REM|')[^\r\n]*)?[\r\n]+
After consuming any leading horizontal whitespace, if I see a REM or ' signifying a comment, I consume that and everything after it until the next line separator. Notice that the only thing that's required to be present is the line separator itself. Also notice the absence of the end anchor, $. It's never necessary to use that when you're explicitly matching the line separators, and in this case it would break the regex. In Multiline mode, $ matches only before a linefeed (\n), not before a carriage-return (\r). (This behavior of the .NET flavor is incorrect and rather surprising, given Microsoft's longstanding preference for \r\n as a line separator.)
Matching the remaining comments is a fundamentally different task. As you've discovered, simply searching for REM or ' is no good because you might find it in a string literal, where it does not signify the start of a comment. What you have to do is start from the beginning of the line, consuming and capturing anything that's not the beginning of a comment or a string literal. If you find a double-quote, go ahead and consume the string literal. If you find a REM or ', stop capturing and go ahead and consume the rest of the line. Then you replace the whole line with just the captured portion--i.e., everything before the comment. Here's the regex:
(?mn)^(?<line>[^\r\n"R']*(("[^"]*"|(?!REM)R)[^\r\n"R']*)*)(REM|')[^\r\n]*
Or, more readably:
(?mn) # Multiline and ExplicitCapture modes
^ # beginning of line
(?<line> # capture in group "line"
[^\r\n"R']* # any number of "safe" characters
(
(
"[^"]*" # a string literal
|
(?!REM)R # 'R' if it's not the beginning of 'REM'
)
[^\r\n"R']* # more "safe" characters
)*
) # stop capturing
(?:REM|') # a comment sigil
[^\r\n]* # consume the rest of the line
The replacement string would be "${line}". Some other notes:
Notice that this regex does not end with [\r\n]+ to consume the line separator, like the "blank lines" regex does.
It doesn't end with $ either, for the same reason as before. The [^\r\n]* will greedily consume everything before the line separator, so the anchor isn't needed.
The only thing that's required to be present is the REM or '; we don't bother matching any line that doesn't contain a comment.
ExplicitCapture mode means I can use (...) instead of (?:...) for all the groups I don't want to capture, but the named group, (?<line>...), still works.
Gnarly as it is, this regex would be a lot worse if VB supported multiline comments, or if its string literals supported backslash escapes.
I don't do VB, but here's a demo in C#.
I've just checked with the two examples from above, '+{.+}$ should do. Optionally, you could go with ('|'')+{.+}$ but the first solution also replaces the xml-descriptions ).
''' <summary>
''' Method Description
''' </summary>
''' <remarks></remarks>
Sub Main()
''first comment
Dim a As String = "" 'second comment
End Sub
Edit: if you use ('+{.+}$|^$\n) it deletes a) all comments and b) all empty lines. However, if you have a comment and a End Sub/Function following, it takes it up one line which results in a compiler error.
Before
''' <summary>
'''
''' </summary>
''' <remarks></remarks>
Sub Main()
''first comment
Dim a As String = "" 'second comment
End Sub
''' <summary>
'''
''' </summary>
''' <returns></returns>
''' <remarks></remarks>
Public Function asdf() As String
Return "" ' returns nothing
End Function
After
Sub Main()
Dim a As String = ""
End Sub
Public Function asdf() As String
Return ""
End Function
Edit: To delete any empty lines Search Replace the following regex ^$\n with empty.
Delete the comments first using this regex
'+\s*(\W|\w).+
'+ - one or more ' for the beginning of each comment.
\s* - if there are spaces after the comment.
(\W|\w).+ - anything that follows except for line terminators.
Then remove the blank lines left using the regex Mr. Alan Moore provided.

Parsing Excel reference with regular expression?

Excel returns a reference of the form
=Sheet1!R14C1R22C71junk
("junk" won't normally be there, but I want to be sure that there's no extraneous text.)
I would like to 'split' this into a VB array, where
a(0)="Sheet1"
a(1)="14"
a(2)="1"
a(3)="22"
a(4)="71"
a(5)="junk"
I'm sure it can be done easily with a regular expression, but I just can't get the hang of it.
Is there a kind soul who could help me?
Thanks
=([^!]+)!R(\d+)C(\d+)R(\d+)C(\d+)(.*)
should work.
[^!]+ matches a sequence of non-exclamation-point characters.
\d+ matches a sequence of digits.
.* matches anything.
So, in VB.NET:
Dim a As Match
a = Regex.Match(SubjectString, "=([^!]+)!R(\d+)C(\d+)R(\d+)C(\d+)(.*)")
If a.Success Then
' matched text: a.Value
' backreference n text: a.Groups(n).Value
Else
' Match attempt failed
End If
A straightforward String.Split would work, provided the "junk" text wasn't there:
Dim input As String = "=Sheet1!R14C1R22C71"
Dim result = input.Split(New Char() { "="c, "!"c, "R"c, "C"c }, StringSplitOptions.RemoveEmptyEntries)
For Each item As String In result
Console.WriteLine(item)
Next
The regex gets a little tricky since you will need to go through the Groups and Captures of the nested portions to get the proper order.
EDIT: here's my regex solution. It accepts multiple occurrences of R's and C's.
Dim input As String = "=Sheet1!R14C1R22C71junk"
Dim pattern As String = "=(?<Sheet>Sheet\d+)!(?:R(?<R>\d+)C(?<C>\d+))+"
Dim m As Match = Regex.Match(input, pattern)
If m.Success Then
Console.WriteLine(m.Groups("Sheet").Value)
For i = 0 To m.Groups("R").Captures.Count - 1
Console.WriteLine(m.Groups("R").Captures(i).Value)
Console.WriteLine(m.Groups("C").Captures(i).Value)
Next
End If
Pattern explanation:
"=(?Sheet\d+)" : matches an = sign followed by "Sheet" and digits. Uses named group of "Sheet"
"!(?:R(?\d+)C(?\d+))+" : matches the exclamation mark followed by at least one occurrence of the *R*xx*C*xx portion of the text. Named groups of "R" and "C" are used.
"(?:...)+" : this portion from the above portion matches but does not capture the inner pattern (i.e., the R/C part). This is to avoid unnecessarily capturing them while we are actually capturing them with the named groups.
More general regexes for R1C1 style:
^=(?:(?<Sheet>[^!]+)!)?(?:R((?<RAbs>\d+)|(?<RRel>\[-?\d+\]))C((?<CAbs>\d+)|(?<CRel>\[-?\d+\]))){1,2}$
And A1 style:
^=(?:(?<Sheet>[^!]+)!)?(?:(?<Col1>\$?[a-z]+)(?<Row1>\$?\d+))(?:\:(?<Col2>\$?[a-z]+)(?<Row2>\$?\d+))?$
It doesn't match external references like =[Book1]Sheet1!A1 though.