How to generate regex patterns in python using re.compile

How to generate regex patterns in python using re.compile - regex

I am trying to create a python code that will be able to extract the information from strings such as the one below, using regular expressions.
date=2019-10-26 time=17:59:00 logid="0000000020" type="traffic" subtype="forward" level="notice" vd="root" eventtime=1572127141 srcip=192.168.6.15 srcname="TR" srcport=522 srcintf="port1" srcintfrole="lan" dstip=172.217.15.194 dstport=43 dstintf="wan2" dstintfrole="wan" poluuid="feb1fa32-d08b-51e7-071f-19e3b5d2213c" sessionid=195421734 proto=6 action="accept" policyid=4 policytype="policy" service="HTTPS" dstcountry="United States" srccountry="Reserved" trandisp="snat" transip=168.168.140.247 transport=294 appid=537 app="Google.Ads" appcat="General.Interest" apprisk="elevated" applist="Seniors" appact="detected" duration=719 sentbyte=2691 rcvdbyte=2856 sentpkt=19 rcvdpkt=25 shapingpolicyid=1 sentdelta=449 rcvddelta=460 devtype="Linux" devcategory="Linux" mastersrcmac="fa:cc:4e:a3:56:2d" srcmac="fa:cc:4e:a3:56:2d" srcserver=0
I found someone's code on github and he uses the lines below to extract the information, however, his code doesn't extract all of the fields I require, most notably srcip=192.168.1.105
I don't want to post the guy's entire code as it's not mine. However, if it is required I can.
I am hoping all the fields will be extracted from the jumble of information so I can save them as a .csv file.

The regex \w+=([^\s"]+|"[^"]*") matches
The field name (at least one word character), then
An = sign, then
Either:
An unquoted field value (at least one character, excluding whitespace and quotes), or
A quoted field value (", then any number of non-quotes, then ").
By adding parentheses around the parts of the regex which match field name, and the unquoted and quoted values, we can extract the relevant parts and put them into a dictionary using a comprehension, using the findall method:
import re
pattern = re.compile(r'(\w+)=(([^\s"]+)|"([^"]*)")')
def parse_fields(text):
return {
name: (value or quoted_value)
for name,_,value,quoted_value in pattern.findall(text)
}

Same as kaya3, but I don't keep the quotes
s = '''date=2019-10-26 time=17:59:00 logid="0000000020" type="traffic"
subtype="forward" level="notice" vd="root" eventtime=1572127141
srcip=192.168.6.15 srcname="TR" srcport=522 srcintf="port1" srcintfrole="lan"
dstip=172.217.15.194 dstport=43 dstintf="wan2" dstintfrole="wan"
poluuid="feb1fa32-d08b-51e7-071f-19e3b5d2213c" sessionid=195421734 proto=6
action="accept" policyid=4 policytype="policy" service="HTTPS"
dstcountry="United States" srccountry="Reserved" trandisp="snat"
transip=168.168.140.247 transport=294 appid=537 app="Google.Ads"
appcat="General.Interest" apprisk="elevated" applist="Seniors"
appact="detected" duration=719 sentbyte=2691 rcvdbyte=2856 sentpkt=19
rcvdpkt=25 shapingpolicyid=1 sentdelta=449 rcvddelta=460 devtype="Linux"
devcategory="Linux" mastersrcmac="fa:cc:4e:a3:56:2d" srcmac="fa:cc:4e:a3:56:2d"
srcserver=0'''
import re
matches = re.findall(r'([a-zA-Z_][a-zA-Z0-9_]*)=(?:"([^"]+)"|(\S+))', s)
d = {
name: quoted or unquoted
for name, quoted, unquoted in matches
}

Related

Aspose Word Merge

I am using Aspose library for Word merge.
In the previous version of the Aspose, if we add white spaces for a field then while merging , it doesn't considers it as empty but after upgrade to latest version, it is considering the whitespaces as blank and removing those fields if setting is ON.
For my case, I want to prevent whitespaces or empty fields for few fileds but remove it for rest of the fields.
I tried to find a setting which can be applied on field level to prevent or remove empty fields but have'nt got any.
Is there any way I can acheive this?

If paragraph contains only whitespaces it is considered as empty and is removed. So for example if you use code like the following:
string[] fieldNames = new string[] { "FirstName", "MidName", "LastName" };
string[] fieldValues = new string[] { "Alexey", " ", "Noskov" };
Document doc = new Document(#"C:\Temp\in.docx");
doc.MailMerge.CleanupOptions = MailMergeCleanupOptions.RemoveEmptyParagraphs;
doc.MailMerge.Execute(fieldNames, fieldValues);
doc.Save(#"C:\Temp\out.docx");
Where MidName merge field is placed in a separate paragraph, the paragraph will be removed as empty.
However, you can work this behavior around using IFieldMergingCallback. For example, you can put hidden text at the merge field to make the paragraph to be considered as not empty. For example see the following code:
string[] fieldNames = new string[] { "FirstName", "MidName", "LastName" };
string[] fieldValues = new string[] { "Alexey", " ", "Noskov" };
Document doc = new Document(#"C:\Temp\in.docx");
doc.MailMerge.FieldMergingCallback = new MergeWhitespaceCallback("MidName");
doc.MailMerge.CleanupOptions = MailMergeCleanupOptions.RemoveEmptyParagraphs;
doc.MailMerge.Execute(fieldNames, fieldValues);
doc.Save(#"C:\Temp\out.docx");
private class MergeWhitespaceCallback : IFieldMergingCallback
{
private readonly string[] mRetainParagraphsWithFields;
public MergeWhitespaceCallback(params string[] retainParagraphsWithFields)
{
mRetainParagraphsWithFields = retainParagraphsWithFields;
}
public void FieldMerging(FieldMergingArgs args)
{
if (!string.IsNullOrEmpty(args.FieldValue.ToString().Trim()))
return;
if (!mRetainParagraphsWithFields.Contains(args.FieldName))
return;
DocumentBuilder builder = new DocumentBuilder(args.Document);
builder.MoveTo(args.Field.Start);
builder.Font.Hidden = true;
builder.Write("<empty paragraph>");
}
public void ImageFieldMerging(ImageFieldMergingArgs args)
{
// Do nothing.
}
}
Later, you can remove hidden text if required.

Assuming you're actually executing a mailmerge (not just overwriting mergefields), you should be able to control most, if not all, of that via mailmerge field coding in the mailmerge main document.
On PCs, you can use the mergefield \b and/or \f switches to suppress a space before or after an empty mergefield. For example, suppose you have:
«Title» «FirstName» «SecondName» «LastName»
but «SecondName» is sometimes empty and you don’t want that to leave two spaces in the output. To deal with that:
select the «SecondName» field and press Shift-F9 so that you get
{MERGEFIELD SecondName};
edit the field code so that you end up with-
{MERGEFIELD SecondName \b " "} or
{MERGEFIELD SecondName \f " "}
depending on whether the space to be suppressed is following or before the mergefield;
delete, as appropriate, the corresponding space following or before
the mergefield;
position the cursor anywhere in this field and press F9 to update it.
Note 1: the \b and \f switches don't work on Macs or in conjunction with other switches. In such cases you need to use and IF test instead, coded along the lines of-
{IF{MERGEFIELD SecondName}<> "" " {MERGEFIELD SecondName}"} or
{IF{MERGEFIELD SecondName}<> "" "{MERGEFIELD SecondName} "}
Even so, you can use the \b and \f switches to express other mergefields that do have switches of their own. For example, suppose you have four fields ‘Product’, ‘Supplier’, ‘Quantity’ and ‘UnitPrice’, and you don’t want to output the ‘Product’, ‘Quantity’ or ‘UnitPrice’ fields if the ‘Supplier’ field is empty. In that case, you might use a field coded along the lines of:
{MERGEFIELD "Supplier" \b "{MERGEFIELD Product}→" \f "→{MERGEFIELD Quantity \# 0}→{MERGEFIELD UnitPrice \# "$0.00"}¶
"}
Note 2: The field brace pairs (i.e. '{ }') for the above example are all created in the document itself, via Ctrl-F9 (Cmd-F9 on a Mac or, if you’re using a laptop, you might need to use Ctrl-Fn-F9); you can't simply type them or copy & paste them from this message. Nor is it practical to add them via any of the standard Word dialogues. Likewise, the chevrons (i.e. '« »') are part of the actual mergefields - which you can insert from the 'Insert Merge Field' dropdown (i.e. you can't type or copy & paste them from this message, either). The spaces represented in the field constructions are all required. Instead of the →, ↵ and ¶ symbols, you should use real tabs and line/paragraph breaks, respectively.
For more Mailmerge Tips & Tricks, see: https://www.msofficeforums.com/mail-merge/21803-mailmerge-tips-tricks.html

Fluentd Parsing

Hi i'm trying to parse single line log using fluentd. Here is log i'm trying to parse.
F2:4200000000000000,F3:000000,F4:000000060000,F6:000000000000,F7:000000000,F8..........etc
This will parse into like this:
{ "F2" : "4200000000000000", "F3" : "000000", "F4" : "000000060000" ............etc }
I tried to use regex but it's confusing and making me write multiple regexes for different keys and values. Is there any easier way to achieve this ?
EDIT1: Heya! I will make this more detailed. I'm currently tailing logs using fluentd to Elasticsearch+Kibana. Here is unparsed example log that fluentd sending to Elasticsearch:
21/09/02 16:36:09.927238: 1 frSMS:0:13995:#HTF4J::141:141:msg0210,00000000000000000,000000,000000,007232,00,#,F2:00000000000000000,F3:002000,F4:000000820000,F6:Random message and strings,F7:.......etc
Elasticsearch recived message:
{"message":"frSMS:0:13995:#HTF4J::141:141:msg0210,00000000000000000,000000,000000,007232,00,#,F2:00000000000000000,F3:002000,F4:000000820000,F6:Random
digits and chars,F7:.......etc"}
This log has only message key so i can't index and create dashboard on only using whole message field. What am i trying to achieve is catch only useful fields, add key into it if it has no key and make indexing easier.
Expected output:
{"logdate" : "21/09/02 16:36:09.927238",
"source" : "frSMS",
"UID" : "#HTF4J",
"statuscode" : "msg0210",
"F2": "00000000000000000",
"F3": "randomchar314516",.....}
I used regex plugin to parse into this but it was too overwhelming and . Here is what i did so far:
^(?<logDate>\d{2}.\d{2}.\d{2}\s\d{2}:\d{2}:\d{2}.\d{6}\b)....(?<source>fr[A-Z]{3,4}|to[A-Z]{3,4}\b).(?<status>\d\b).(?<dummyfield>\d{5}\b).(?<HUID>.[A-Z]{5}\b)..(?<d1>\d{3}\b).(?<d2>\d{3}\b).(?<msgcode>msg\d{4}\b).(?<dummyfield1>\d{16}\b).(?<dummyfield2>\d{6}\b).(?<dummyfield3>\d{6,7}\b).(?<dummyfield4>\d{6}\b).(?<dummyfield5>\d{2}\b)...
Which results to :
"logDate": "21/09/02 16:36:09.205706",
"source": "toSMS" ,
"status": "0",
"dummyfield": "13995" ,
"UID" : "#HTFAA" ,
"d1" : "156" ,
"d2" : "156" ,
"msgcode" : "msg0210",
"dummyfield1" :"0000000000000000" ,
"dummyfield2" :"002000",
"dummyfield3" :"2000000",
"dummyfield4" :"00",
"dummyfield5" :"2000000" ,
"dummyfield6" :"867202"
Which only applies to example log and has useless fields like field1, dummyfield, dummyfield1 etc.
Other logs has the useful values and keys(date,source,msgcode,UID,F1,F2 fields) like i showcased on expected output. Not useful fields are not static(they can be none, or has less|more digits and chars) so they trigger the pattern not matched error.
So the question is:
How do i capture useful fields that i mentioned using regex?
How do i capture F1,F2,F3...... fields that has different value
patterns like char string mixed?
PS: I wraped the regex i wrote into html snippet so the <> capturing fields don't get deleted

Regex pattern to use:
(F[\d]+):([\d]+)
This pattern will catch all the 'F' values with whatever digit that comes after - yes even if it's F105 it still works. This whole 'F105' will be stored as the first group in your regex match expression
The right part of the above pattern will catch the value of all the digits following ':' up until any charachter that is not a digit. i.e. ',', 'F', etc.. and will store it as the second group in your regex match
Use
Depending on your coding language you will have to access your regex matches variable with an iterator and extract group 1 and group 2 respectivly
Python example:
import re
log = 'F2:4200000000000000,F3:000000,F4:000000060000,F6:000000000000,F7:000000000,F105:9726450'
pattern = '(F[\d]+):([\d]+)'
matches = re.finditer(pattern,log)
log_dict = {}
for match in matches:
log_dict[match.group(1)] = match.group(2)
print(log_dict)
Output
{'F2': '4200000000000000', 'F3': '000000', 'F4': '000000060000', 'F6': '000000000000', 'F7': '000000000', 'F105': '9726450'}

Assuming the logdate will be static(in pattern wise) You can ignore useless values using ".+" regex and get collect the useful values by their patterns. So the regex will be like this :
(?\d{2}.\d{2}.\d{2}\s\d{2}:\d{2}:\d{2}.\d{6}\b).+(?fr[A-Z]{3,4}|to[A-Z]{3,4}).+(?#[A-Z0-9]{5}).+(?msg\d{4})
And output will be like:
{"logdate" : "21/09/02 16:36:09.927238", "source" : "frSMS",
"UID" : "#HTF4J","statuscode" : "msg0210"}
And I'm working on getting F2,F3,FN keys and values.

Adding REGEX entities to SpaCy's Matcher

I am trying to add entities defined by regular expressions to SpaCy's NER pipeline. Ideally, I should be able to use any regular expression loaded from a json file with a defined entity type. As an example, I am trying to execute the code below.
The code below shows what I am trying to do, following an example given on Spacy's discussion about custom attributes using regular expressions. I have tried calling the 'set_extension' method in various ways (to Doc, Span, Token), but to no avail. I'm not even sure what I should be setting them to.
nlp = spacy.load("en_core_web_lg")
matcher = Matcher(nlp.vocab)
pattern = [{"_": {"country": {"REGEX": "^[Uu](\.?|nited) ?[Ss](\.|tates)$"}}}]
matcher.add("US", None, pattern)
doc = nlp(u"I'm from the United States.")
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(match_id, string_id, start, end, span.text)
I expect match_id, string_id 3 4 United States to be printed out.
Instead, I am getting AttributeError: [E046] Can't retrieve unregistered extension attribute 'country'. Did you forget to call the 'set_extension' method?

There's documentation around the extension attributes here: https://spacy.io/usage/processing-pipelines#custom-components-attributes
Basically you'll have to define this country variable as an extension attribute, something like this:
Token.set_extension("country", default="")
However, in the code you cited you're never actually setting the _.country attribute to any token (or span), so they're all still at default value, and the matcher will never be able to get a match on them. The line you cited:
pattern = [{"_": {"country": {"REGEX": "^[Uu](\.?|nited) ?[Ss](\.?|tates)$"}}}]
Tries to match the United States regex on the custom attribute values, instead of on the doc text, as you expect (I think).
One solution is just to run the reg-exps on the texts directly:
nlp = spacy.load("en_core_web_lg")
matcher = Matcher(nlp.vocab)
pattern = [{"TEXT": {"REGEX": "^[Uu](\.?|nited)$"}},
{"TEXT": {"REGEX": "^[Ss](\.?|tates)$"}}]
matcher.add("US", None, pattern)
doc = nlp(u"I'm from the United States.")
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(match_id, string_id, start, end, span.text)
Which outputs
15397641858402276818 US 4 6 United States
Then you can use those matches to e.g. set a custom attribute on the Span's or Token's (in this case Span, because your match is potentially involving multiple tokens)

What's the best way to match strings in a file to case class in Scala?

We have a file that contains data that we want to match to a case class. I know enough to brute force it but looking for an idiomatic way in scala.
Given File:
#record
name:John Doe
age: 34
#record
name: Smith Holy
age: 33
# some comment
#record
# another comment
name: Martin Fowler
age: 99
(field values on two lines are INVALID, e.g. name:John\n Smith should error)
And the case class
case class Record(name:String, age:Int)
I Want to return a Seq type such as Stream:
val records: Stream records
The couple of ideas i'm working with but so far haven't implemented is:
Remove all new lines and treat the whole file as one long string. Then grep match on the string "((?!name).)+((?!age).)+age:([\s\d]+)" and create a new object of my case class for each match but so far my regex foo is low and can't match around comments.
Recursive idea: Iterate through each line to find the first line that matches record, then recursively call the function to match name, then age. Tail recursively return Some(new Record(cumulativeMap.get(name), cumulativeMap.get(age)) or None when hitting the next record after name (i.e. age was never encountered)
?? Better Idea?
Thanks for reading! The file is more complicated than above but all rules are equal. For the curious: i'm trying to parse a custom M3U playlist file format.

I'd use kantan.regex for a fairly trivial regex based solution.
Without fancy shapeless derivation, you can write the following:
import kantan.regex._
import kantan.regex.implicits._
case class Record(name:String, age:Int)
implicit val decoder = MatchDecoder.ordered(Record.apply _)
input.evalRegex[Record](rx"(?:name:\s*([^\n]+))\n(?:age:\s*([0-9]+))").toList
This yields:
List(Success(Record(John Doe,34)), Success(Record(Smith Holy,33)), Success(Record(Martin Fowler,99)))
Note that this solution requires you to hand-write decoder, but it can often be automatically derived. If you don't mind a shapeless dependency, you could simply write:
import kantan.regex._
import kantan.regex.implicits._
import kantan.regex.generic._
case class Record(name:String, age:Int)
input.evalRegex[Record](rx"(?:name:\s*([^\n]+))\n(?:age:\s*([0-9]+))").toList
And get the exact same result.
Disclaimer: I'm the library's author.

You could use Parser Combinators.
If you have the file format specification in BNF or can write one, then Scala can create a parser for you from those rules. This may be more robust than hand-made regex based parsers. It's certainly more "Scala".

I don't have much experience in Scala, but could these regexes work:
You could use (?<=name:).* to match name value, and (?<=age:).* to match the age value. If you use this, remove spaces in found matches, otherwise name: bob will match bob with a space before, you might not want that.
If name: or any other tag is in comment, or comment is after value, something will be matched. Please leave a comment if you want to avoid that.

You could try this:
Path file = Paths.get("file.txt");
val lines = Files.readAllLines(file, Charset.defaultCharset());
val records = lines.filter(s => s.startsWith("age:") || s.startsWith("name:"))
.grouped(2).toList.map {
case List(a, b) => Record(a.replaceAll("name:", "").trim,
b.replaceAll("age:", "").trim.toInt)
}

Search for an item in a text file using UIMA Ruta

I have been trying to search for an item which is there in a text file.
The text file is like
Eg: `
>HEADING
00345
XYZ
MethodName : fdsafk
Date: 23-4-2012
More text and some part containing instances of XYZ`
So I did a dictionary search for XYZ initially and found the positions, but I want only the 1st XYZ and not the rest. There is a property of XYZ that , it will always be between the 5 digit code and the text MethondName .
I am unable to do that.
WORDLIST ZipList = 'Zipcode.txt';
DECLARE Zip;
Document
Document{-> MARKFAST(Zip, ZipList)};
DECLARE Method;
"MethodName" -> Method;
WORDLIST typelist = 'typelist.txt';
DECLARE type;
Document{-> MARKFAST(type, typelist)};
Also how do we use REGEX in UIMA RUTA?

There are many ways to specify this. Here are some examples (not tested):
// just remove the other annotations (assuming type is the one you want)
type{-> UNMARK(type)} ANY{-STARTSWITH(Method)};
// only keep the first one: remove any annotation if there is one somewhere in front of it
// you can also specify this with POSISTION or CURRENTCOUNT, but both are slow
type # #type{-> UNMARK(type)}
// just create a new annotation in between
NUM{REGEXP(".....")} #{-> type} #Method;
There are two options to use regex in UIMA Ruta:
(find) simple regex rules like "[A-Za-z]+" -> Type;
(matches) REGEXP conditions for validating the match of a rule element like
ANY{REGEXP("[A-Za-z]+")-> Type};
Let me know if something is not clear. I will extend the description then.
DISCLAIMER: I am a developer of UIMA Ruta

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to generate regex patterns in python using re.compile - regex

Related

Aspose Word Merge

Fluentd Parsing

Adding REGEX entities to SpaCy's Matcher

What's the best way to match strings in a file to case class in Scala?

Search for an item in a text file using UIMA Ruta

Categories

Resources