Using Regex to parse ASCII protocol - regex

I'm working on a simple application that interacts with a device via an Telnet session with a ASCII based protocol.
There will be a lot of interaction with the device so i'm looking for a fast way to parse the incoming string. Now the manufacturer was so kind to release there Regex scheme. But since Regex is very new to me i don't understand how to retrieve the value. I know how to match but when i match i want to get the value from it.
Regex scheme
NameAndValue := [A-Z_]+:("(\\.|[^"\\])*"|(\\.|[^\s"\\])*)
Value := ("(\\.|[^"\\])*"|(\\.|[^\s"\\])*)
ValueUnquoted := (\\.|[^\s"\\])*
ValueQuoted := "(\\.|[^"\\])*"
CharQuoted := (\\.|[^"\\])
CharUnquoted := (\\.|[^\s"\\])
EscapedChar := \\.
CharCommon := [^\s"\\]
CharEscape := \\
CharQuote := "
CharSpace := \s
Example of a response
CMD1:"string value" CMD2:1 CMD3:"string value again" <LF> or <CR>+<LF>
I've read a lot of documentation and tried lot's of approaches, however someone could point me out in the right direct.
I did however wrote a simple parser that finds the index positions of commands and there values and then uses a substring to retrieve only the value. It works, but i prefer an "nicer" way with the power of Regex.
--------- EDIT 18-10-2017 ---------
Request of #VBobCat to provide a more detailed "parsing" requirement.
So let's say i have a object with the properties Foo and Bar and we have a second object with the properties cat and dog
Now when i receive the string via telnet i have to parse it to one of those objects. Lucky the string always begins with what it holds. So lets say x for object with Foo and Bar and animal for object with cat and dog.
Now with the provided Regex i want to parse the values in the string to the properties of the object. Something like:
X CMD1_Foo:1 CMD2_Bar:"string value" <LF> or <CR>+<LF>
Object X.Foo = CMD1_Foo.value
Object X.Bar = CMD2_Bar.value
OR
Animal CMD1_Cat:"Miauw" CMD2_Dog:"woef" <LF> or <CR>+<LF>
Object X.Cat = CMD1_Cat.value
Object X.Dog = CMD2_Dog.value

If all your samples are consistent with your example, this could work:
Function ParseTelnet(input As String) As DataTable
Dim retTable As New DataTable
retTable.Columns.Add("command", GetType(String))
retTable.Columns.Add("value", GetType(String))
Dim entries = System.Text.RegularExpressions.Regex.Split(input, "\s+(?=\w+:)")
Dim pairs = entries.Select(
Function(entry) If(entry, "").Trim(Chr(9), Chr(10), Chr(13), Chr(32)).Split({":"c}, 2)).Where(
Function(pair) pair.Count = 2)
For Each pair In pairs
If pair(1).StartsWith("""") AndAlso pair(1).EndsWith("""") Then
retTable.Rows.Add(pair(0), pair(1).Substring(1, pair(1).Length - 2))
Else
retTable.Rows.Add(pair(0), pair(1))
End If
Next
Return retTable
End Function

Related

golang extract unique key, value from a key=value pair string using regex

I have the following go string:
dbConnStr := "user=someone password=something host=superduperhost sslmode=something"
but the k=v pair code may be in any order, for example:
dbConnStr := "host=superduperhost user=someone password=something"
Notice the difference in the key order and also the missing "sslmode" key in the str.
Also, it is possible that instead of whitespace, the individual k,v pairs may be separated by newline too.
Now I want to extract the unique keys and their corresponding values from the given string, using regexp. If it will help, I can give a list of all the possible keys that may come (username, password, host, sslmode), but I would ideally like a regex solution that works with any list of keys and values.
How to do this ? I understand that it may be possible with regexp.FindStringSubmatch but not able to wrap my head around writing the regexp.
Got answer to this from golang nuts group.
var rex = regexp.MustCompile("(\\w+)=(\\w+)")
conn := `user=someone password=something host=superduperhost
sslmode=something`
data := rex.FindAllStringSubmatch(conn, -1)
res := make(map[string]string)
for _, kv := range data {
k := kv[1]
v := kv[2]
res[k] = v
}
fmt.Println(res)
Golang Playground url: https://play.golang.org/p/xSEX1CAcQE
Personally I would look at something like:
((user|password|host)=([\w]+)
giving you key in \1 and value in \2.
Example on playground:
https://play.golang.org/p/6-Ler6-MrY

How to insert a new line after each occurrence of a particular format in a text field

I have a system that I can output a spreadsheet from. I then take this outputted spreadsheet and import it into MS Access. There, I run some basic update queries before merging the final result into a SharePoint 2013 Linked List.
The spreadsheet I output has an unfortunate Long Text field which has some comments in it, which are vital. On the system that hosts the spreadsheet, these comments are nicely formatted. When the spreadsheet it output though, the field turns into a long, very unpretty string like so:
09:00 on 01/03/2017, Firstname Surname. :- Have responded to request for more information. 15:12 on 15/02/2017, Firstname Surname. :- Need more information to progress request. 17:09 on 09/02/2017, Firstname Surname. :- Have placed request.
What I would like to do is run a query (either in MS Access or MS Excel) which can scan this field, detect occurrences of "##:## on ##/##/####, Firstname Surname. :-" and then automatically insert a line break before them, so this text is more neatly formatted. It would obviously skip the first occurrence of this format, as otherwise it would enter a new line at the start of the field. Ideal end result would be:
09:00 on 01/03/2017, Firstname Surname. :- Have responded to request
for more information.
15:12 on 15/02/2017, Firstname Surname. :- Need more information to progress request.
17:09 on 09/02/2017, Firstname Surname. :- Have placed request.
To be honest, I haven't tried much myself so far, as I really don't know where to start. I don't know if this can be done without regular expressions, or within a simple query versus VBA code.
I did start building a regular expression, like so:
[0-9]{2}:[0-9]{2}\s[o][n]\s[0-9]{2}\/[0-9]{2}\/[0-9]{4}\,\s
But this looks a little ridiculous and I'm fairly certain I'm going about it in a very unnecessary way. From what I can see from the text, detecting the next occurrence of "##:## on ##/##/####" should be enough. If I take a new line after this, that will suffice.
You have your RegExp pattern, now you need to create a function to append found items with your extra delimiter.
look at this function. It takes, your long string and finds your date-stamp using your pattern and appends with your delimiter.
Ideally, i would run each line twice and add delimiters after each column so you have a string like,
datestamp;firstname lastname;comment
you can then use arr = vba.split(text, ";") to get your data into an array and use it as
date-stamp = arr(0)
name = arr(1)
comment = arr(2)
Public Function FN_REGEX_REPLACE(iText As String, iPattern As String, iDelimiter As String) As String
Dim objRegex As Object
Dim allmatches As Variant
Dim I As Long
On Error GoTo FN_REGEX_REPLACE_Error
Set objRegex = CreateObject("vbscript.regexp")
With objRegex
.Multiline = True
.Global = True
.IgnoreCase = True
.Pattern = iPattern
If .test(iText) Then
Set allmatches = .Execute(iText)
If allmatches.count > 0 Then
For I = 1 To allmatches.count - 1 ' for i = 0 to count will start from first match
iText = VBA.Replace(iText, allmatches.item(I), iDelimiter & allmatches.item(I))
Next I
End If
End If
End With
FN_REGEX_REPLACE = Trim(iText)
Set objRegex = Nothing
On Error GoTo 0
Exit Function
FN_REGEX_REPLACE_Error:
MsgBox Err.description
End Function
use above function as
mPattern = "[0-9]{2}:[0-9]{2}\s[o][n]\s[0-9]{2}\/[0-9]{2}\/[0-9]{4}\,"
replacedText = FN_REGEX_REPLACE(originalText,mPattern,vbnewline)
Excel uses LF for linebreaks, Access uses CRLF.
So it should suffice to run a simple replacement query:
UPDATE myTable
SET LongTextField = Replace([LongTextField], Chr(10), Chr(13) & Chr(10))
WHERE <...>
You need to make sure that this runs only once on newly imported records, not repeatedly on all records.

Search for an item in a text file using UIMA Ruta

I have been trying to search for an item which is there in a text file.
The text file is like
Eg: `
>HEADING
00345
XYZ
MethodName : fdsafk
Date: 23-4-2012
More text and some part containing instances of XYZ`
So I did a dictionary search for XYZ initially and found the positions, but I want only the 1st XYZ and not the rest. There is a property of XYZ that , it will always be between the 5 digit code and the text MethondName .
I am unable to do that.
WORDLIST ZipList = 'Zipcode.txt';
DECLARE Zip;
Document
Document{-> MARKFAST(Zip, ZipList)};
DECLARE Method;
"MethodName" -> Method;
WORDLIST typelist = 'typelist.txt';
DECLARE type;
Document{-> MARKFAST(type, typelist)};
Also how do we use REGEX in UIMA RUTA?
There are many ways to specify this. Here are some examples (not tested):
// just remove the other annotations (assuming type is the one you want)
type{-> UNMARK(type)} ANY{-STARTSWITH(Method)};
// only keep the first one: remove any annotation if there is one somewhere in front of it
// you can also specify this with POSISTION or CURRENTCOUNT, but both are slow
type # #type{-> UNMARK(type)}
// just create a new annotation in between
NUM{REGEXP(".....")} #{-> type} #Method;
There are two options to use regex in UIMA Ruta:
(find) simple regex rules like "[A-Za-z]+" -> Type;
(matches) REGEXP conditions for validating the match of a rule element like
ANY{REGEXP("[A-Za-z]+")-> Type};
Let me know if something is not clear. I will extend the description then.
DISCLAIMER: I am a developer of UIMA Ruta

regex to return all values not just first found one

I'm learning Pig Latin and am using regular expressions. Not sure if the regex is language agnostic or not but here is what I'm trying to do.
If I have a table with two fields: tweet id and tweet, I'd like to go through each tweet and pull out all mentions up to 3.
So if a tweet goes something like "#tim bla #sam #joe something bla bla" then the line item for that tweet will have tweet id, tim, sam, joe.
The raw data has twitter ids not the actual handles so this regex seems to return a mention (.*)#user_(\\S{8})([:| ])(.*)
Here is what I have tried:
a = load 'data.txt' AS (id:chararray, tweet:chararray);
b = foreach a generate id, LOWER(tweet) as tweet;
// filter data so only tweets with mentions
c = FILTER b BY tweet MATCHES '(.*)#user_(\\S{8})([:| ])(.*)';
// try to pull out the mentions.
d = foreach c generate id,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){1}',3) as mention1,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){1,2}',3) as mention2,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){2,3}',3) as mention3;
e = limit d 20;
dump e;
So in that try I was playing with quantifiers, trying to return the first, second and 3rd instance of a match in a tweet {1}, {1,2}, {2,3}.
That did not work, mention 1-3 are just empty.
So I tried changing d:
d = foreach c generate id,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)',2) as mention1,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)',5) as mention2,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)',8) as mention3,
But, instead of returning each user mentioned, this returned the same mention 3 times. I had expected that by cutting n pasting the expression again I'd get the second match, and pasting it a 3rd time would get the 3rd match.
I'm not sure how well I've managed to word this question but to put it another way, imagine that the function regex_extract() returned an array of matched terms. I would like to get mention[0], mention[1], mention[2] on a single line item.
Whenever you use PATTERN_EXTRACT or PATTERN_EXTRACT_ALL udf, keep in mind that it is just pure regex handled by Java.
It is easier to test the regex through a local Java test. Here is the regex I found to be acceptable :
Pattern p = Pattern.compile("#(\\S+).*?(?:#(\\S+)(?:.*?#(\\S+))?)?");
String input = "So if a tweet goes something like #tim bla #sam #joe #bill something bla bla";
Matcher m = p.matcher(input);
if(m.find()){
for(int i=0; i<=m.groupCount(); i++){
System.out.println(i + " -> " + m.group(i));
}
}
With this regex, if there is at least a mention, it will returns three fields, the seconds and/or third being null if a second/third mention is not found.
Therefore, you may use the following PIG code :
d = foreach c generate id, REGEX_EXTRACT_ALL(
tweet, '#(\\S+).*?(?:#(\\S+)(?:.*?#(\\S+))?)?');
You do not even need to filter the data first.

From MatchCollection to Array or Pair of Values

Is there a way to put regex captures directly into an array without the intervening MatchCollection?
I want something like: Set myArray = myRegEx.Execute(myString)(0).SubMatches
Or at a minimum, if I knew the number of captures that I could "tie" the return values: (myFirst, mySecond) = myRegEx.Execute(myString)(0).SubMatches
I know that use of SubMatches is made up, I'm just trying to find a way to accomplish the circumvention of the intervening MatchCollection.
OK, maybe this will get you started.
From Outlook macro to Excel macro is not something I would do. Instead, I would recommend binding one application to the other and doing whatever you need to do with both object model references exposed to the VBProject.
But in any case you can do it the way you describe, and this should be an example.
This example assumes that Excel is already open, and further that the Workbook which contains the macro is also open in that instance of Excel. In Excel I create a simple procedure which accepts a generic Object argument. I do this to avoid needing the explicit reference tot he Microsoft VBScript Regular Expressions library.
This way, you have a macro in Excel which accepts (requires, actually) an object variable. In this case it is going to be a SubMatches object. (Make sure to change "Book9" to the name of your workbook, or modify as needed to allow user to select/open a workbook, etc.)
Sub excelmacro(SM As Object)
MsgBox SM.Count & " submatches"
End Sub
Now, I have a very simple Outlook procedure to test this and verify it works. In this case there will be no submatches, so the Excel procedure above will display a messagebox 0 submatches.
Sub test_to_Excel()
'### Requires reference to Microsoft VBScript Regular Expressions 5.0 ###
Dim re As New RegExp
Dim mySubmatches As SubMatches
Dim xl As Object 'Excel.Application
Dim wb As Object 'Excel.Workbook
With re
.Global = True
.Pattern = "asd"
'## Now get a handle on the particular indexed match.submatches()
Set mySubmatches = .Execute("asdfkjasdfj; asdl asdfklwedrewn adg")(1).SubMatches
End With
'## Now we can send to Excel procedure:
'## Assumes Excel is already running and the file which contains the macro
' is already open
Set xl = GetObject(, "Excel.Application")
Set wb = xl.Workbooks("Book9")
'## This tells the Excel application to run a named procedure
' and passes the variable argument(s) to that procedure
xl.Application.Run "excelmacro", mySubmatches
End Sub