Adding REGEX entities to SpaCy's Matcher - regex

I am trying to add entities defined by regular expressions to SpaCy's NER pipeline. Ideally, I should be able to use any regular expression loaded from a json file with a defined entity type. As an example, I am trying to execute the code below.
The code below shows what I am trying to do, following an example given on Spacy's discussion about custom attributes using regular expressions. I have tried calling the 'set_extension' method in various ways (to Doc, Span, Token), but to no avail. I'm not even sure what I should be setting them to.
nlp = spacy.load("en_core_web_lg")
matcher = Matcher(nlp.vocab)
pattern = [{"_": {"country": {"REGEX": "^[Uu](\.?|nited) ?[Ss](\.|tates)$"}}}]
matcher.add("US", None, pattern)
doc = nlp(u"I'm from the United States.")
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(match_id, string_id, start, end, span.text)
I expect match_id, string_id 3 4 United States to be printed out.
Instead, I am getting AttributeError: [E046] Can't retrieve unregistered extension attribute 'country'. Did you forget to call the 'set_extension' method?

There's documentation around the extension attributes here: https://spacy.io/usage/processing-pipelines#custom-components-attributes
Basically you'll have to define this country variable as an extension attribute, something like this:
Token.set_extension("country", default="")
However, in the code you cited you're never actually setting the _.country attribute to any token (or span), so they're all still at default value, and the matcher will never be able to get a match on them. The line you cited:
pattern = [{"_": {"country": {"REGEX": "^[Uu](\.?|nited) ?[Ss](\.?|tates)$"}}}]
Tries to match the United States regex on the custom attribute values, instead of on the doc text, as you expect (I think).
One solution is just to run the reg-exps on the texts directly:
nlp = spacy.load("en_core_web_lg")
matcher = Matcher(nlp.vocab)
pattern = [{"TEXT": {"REGEX": "^[Uu](\.?|nited)$"}},
{"TEXT": {"REGEX": "^[Ss](\.?|tates)$"}}]
matcher.add("US", None, pattern)
doc = nlp(u"I'm from the United States.")
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(match_id, string_id, start, end, span.text)
Which outputs
15397641858402276818 US 4 6 United States
Then you can use those matches to e.g. set a custom attribute on the Span's or Token's (in this case Span, because your match is potentially involving multiple tokens)

Related

How to generate regex patterns in python using re.compile

I am trying to create a python code that will be able to extract the information from strings such as the one below, using regular expressions.
date=2019-10-26 time=17:59:00 logid="0000000020" type="traffic" subtype="forward" level="notice" vd="root" eventtime=1572127141 srcip=192.168.6.15 srcname="TR" srcport=522 srcintf="port1" srcintfrole="lan" dstip=172.217.15.194 dstport=43 dstintf="wan2" dstintfrole="wan" poluuid="feb1fa32-d08b-51e7-071f-19e3b5d2213c" sessionid=195421734 proto=6 action="accept" policyid=4 policytype="policy" service="HTTPS" dstcountry="United States" srccountry="Reserved" trandisp="snat" transip=168.168.140.247 transport=294 appid=537 app="Google.Ads" appcat="General.Interest" apprisk="elevated" applist="Seniors" appact="detected" duration=719 sentbyte=2691 rcvdbyte=2856 sentpkt=19 rcvdpkt=25 shapingpolicyid=1 sentdelta=449 rcvddelta=460 devtype="Linux" devcategory="Linux" mastersrcmac="fa:cc:4e:a3:56:2d" srcmac="fa:cc:4e:a3:56:2d" srcserver=0
I found someone's code on github and he uses the lines below to extract the information, however, his code doesn't extract all of the fields I require, most notably srcip=192.168.1.105
I don't want to post the guy's entire code as it's not mine. However, if it is required I can.
I am hoping all the fields will be extracted from the jumble of information so I can save them as a .csv file.
The regex \w+=([^\s"]+|"[^"]*") matches
The field name (at least one word character), then
An = sign, then
Either:
An unquoted field value (at least one character, excluding whitespace and quotes), or
A quoted field value (", then any number of non-quotes, then ").
By adding parentheses around the parts of the regex which match field name, and the unquoted and quoted values, we can extract the relevant parts and put them into a dictionary using a comprehension, using the findall method:
import re
pattern = re.compile(r'(\w+)=(([^\s"]+)|"([^"]*)")')
def parse_fields(text):
return {
name: (value or quoted_value)
for name,_,value,quoted_value in pattern.findall(text)
}
Same as kaya3, but I don't keep the quotes
s = '''date=2019-10-26 time=17:59:00 logid="0000000020" type="traffic"
subtype="forward" level="notice" vd="root" eventtime=1572127141
srcip=192.168.6.15 srcname="TR" srcport=522 srcintf="port1" srcintfrole="lan"
dstip=172.217.15.194 dstport=43 dstintf="wan2" dstintfrole="wan"
poluuid="feb1fa32-d08b-51e7-071f-19e3b5d2213c" sessionid=195421734 proto=6
action="accept" policyid=4 policytype="policy" service="HTTPS"
dstcountry="United States" srccountry="Reserved" trandisp="snat"
transip=168.168.140.247 transport=294 appid=537 app="Google.Ads"
appcat="General.Interest" apprisk="elevated" applist="Seniors"
appact="detected" duration=719 sentbyte=2691 rcvdbyte=2856 sentpkt=19
rcvdpkt=25 shapingpolicyid=1 sentdelta=449 rcvddelta=460 devtype="Linux"
devcategory="Linux" mastersrcmac="fa:cc:4e:a3:56:2d" srcmac="fa:cc:4e:a3:56:2d"
srcserver=0'''
import re
matches = re.findall(r'([a-zA-Z_][a-zA-Z0-9_]*)=(?:"([^"]+)"|(\S+))', s)
d = {
name: quoted or unquoted
for name, quoted, unquoted in matches
}

Matching PoS tags with specific text with `testacy.extract.pos_regex_matches(...)`

I'm using textacy's pos_regex_matches method to find certain chunks of text in sentences.
For instance, assuming I have the text: Huey, Dewey, and Louie are triplet cartoon characters., I'd like to detect that Huey, Dewey, and Louie is an enumeration.
To do so, I use the following code (on testacy 0.3.4, the version available at the time of writing):
import textacy
sentence = 'Huey, Dewey, and Louie are triplet cartoon characters.'
pattern = r'<PROPN>+ (<PUNCT|CCONJ> <PUNCT|CCONJ>? <PROPN>+)*'
doc = textacy.Doc(sentence, lang='en')
lists = textacy.extract.pos_regex_matches(doc, pattern)
for list in lists:
print(list.text)
which prints:
Huey, Dewey, and Louie
However, if I have something like the following:
sentence = 'Donald Duck - Disney'
then the - (dash) is recognised as <PUNCT> and the whole sentence is recognised as a list -- which it isn't.
Is there a way to specify that only , and ; are valid <PUNCT> for lists?
I've looked for some reference about this regex language for matching PoS tags with no luck, can anybody help? Thanks in advance!
PS: I tried to replace <PUNCT|CCONJ> with <[;,]|CCONJ>, <;,|CCONJ>, <[;,]|CCONJ>, <PUNCT[;,]|CCONJ>, <;|,|CCONJ> and <';'|','|CCONJ> as suggested in the comments, but it didn't work...
Is short, it is not possible: see this official page.
However the merge request contains the code of the modified version described in the page, therefore one can recreate the functionality, despite it's less performing than using a SpaCy's Matcher (see code and example -- though I have no idea how to reimplement my problem using a Matcher).
If you want to go down this lane anyway, you have to change the line:
words.extend(map(lambda x: re.sub(r'\W', '', x), keyword_map[w]))
with the following:
words.extend(keyword_map[w])
otherwise every symbol (like , and ; in my case) will be stripped off.

Search for an item in a text file using UIMA Ruta

I have been trying to search for an item which is there in a text file.
The text file is like
Eg: `
>HEADING
00345
XYZ
MethodName : fdsafk
Date: 23-4-2012
More text and some part containing instances of XYZ`
So I did a dictionary search for XYZ initially and found the positions, but I want only the 1st XYZ and not the rest. There is a property of XYZ that , it will always be between the 5 digit code and the text MethondName .
I am unable to do that.
WORDLIST ZipList = 'Zipcode.txt';
DECLARE Zip;
Document
Document{-> MARKFAST(Zip, ZipList)};
DECLARE Method;
"MethodName" -> Method;
WORDLIST typelist = 'typelist.txt';
DECLARE type;
Document{-> MARKFAST(type, typelist)};
Also how do we use REGEX in UIMA RUTA?
There are many ways to specify this. Here are some examples (not tested):
// just remove the other annotations (assuming type is the one you want)
type{-> UNMARK(type)} ANY{-STARTSWITH(Method)};
// only keep the first one: remove any annotation if there is one somewhere in front of it
// you can also specify this with POSISTION or CURRENTCOUNT, but both are slow
type # #type{-> UNMARK(type)}
// just create a new annotation in between
NUM{REGEXP(".....")} #{-> type} #Method;
There are two options to use regex in UIMA Ruta:
(find) simple regex rules like "[A-Za-z]+" -> Type;
(matches) REGEXP conditions for validating the match of a rule element like
ANY{REGEXP("[A-Za-z]+")-> Type};
Let me know if something is not clear. I will extend the description then.
DISCLAIMER: I am a developer of UIMA Ruta

regex to return all values not just first found one

I'm learning Pig Latin and am using regular expressions. Not sure if the regex is language agnostic or not but here is what I'm trying to do.
If I have a table with two fields: tweet id and tweet, I'd like to go through each tweet and pull out all mentions up to 3.
So if a tweet goes something like "#tim bla #sam #joe something bla bla" then the line item for that tweet will have tweet id, tim, sam, joe.
The raw data has twitter ids not the actual handles so this regex seems to return a mention (.*)#user_(\\S{8})([:| ])(.*)
Here is what I have tried:
a = load 'data.txt' AS (id:chararray, tweet:chararray);
b = foreach a generate id, LOWER(tweet) as tweet;
// filter data so only tweets with mentions
c = FILTER b BY tweet MATCHES '(.*)#user_(\\S{8})([:| ])(.*)';
// try to pull out the mentions.
d = foreach c generate id,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){1}',3) as mention1,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){1,2}',3) as mention2,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){2,3}',3) as mention3;
e = limit d 20;
dump e;
So in that try I was playing with quantifiers, trying to return the first, second and 3rd instance of a match in a tweet {1}, {1,2}, {2,3}.
That did not work, mention 1-3 are just empty.
So I tried changing d:
d = foreach c generate id,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)',2) as mention1,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)',5) as mention2,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)',8) as mention3,
But, instead of returning each user mentioned, this returned the same mention 3 times. I had expected that by cutting n pasting the expression again I'd get the second match, and pasting it a 3rd time would get the 3rd match.
I'm not sure how well I've managed to word this question but to put it another way, imagine that the function regex_extract() returned an array of matched terms. I would like to get mention[0], mention[1], mention[2] on a single line item.
Whenever you use PATTERN_EXTRACT or PATTERN_EXTRACT_ALL udf, keep in mind that it is just pure regex handled by Java.
It is easier to test the regex through a local Java test. Here is the regex I found to be acceptable :
Pattern p = Pattern.compile("#(\\S+).*?(?:#(\\S+)(?:.*?#(\\S+))?)?");
String input = "So if a tweet goes something like #tim bla #sam #joe #bill something bla bla";
Matcher m = p.matcher(input);
if(m.find()){
for(int i=0; i<=m.groupCount(); i++){
System.out.println(i + " -> " + m.group(i));
}
}
With this regex, if there is at least a mention, it will returns three fields, the seconds and/or third being null if a second/third mention is not found.
Therefore, you may use the following PIG code :
d = foreach c generate id, REGEX_EXTRACT_ALL(
tweet, '#(\\S+).*?(?:#(\\S+)(?:.*?#(\\S+))?)?');
You do not even need to filter the data first.

Setting a regex optional group to None in Scala

I have the following regular expression pattern that matches fully qualified Microsoft SQL Server table names ([dbName].[schemaName].[tableName]), where the schema name is optional:
val tableNamePattern = """\[(\w+)\](?:\.\[(\w+)\])?\.\[(\w+)\]""".r
I am using it like this:
val tableNamePattern(database, schema, tableName) = fullyQualifiedTableName
When the schema name is missing (e.g.: [dbName].[tableName]), the schema value gets set to null.
Is there a Scala idiomatic way to set it to None instead, and to Some(schema) when the schemaName is provided?
Some people, when confronted with a problem, think
“I know, I'll use regular expressions.” Now they have two problems.
-- Jamie Zawinski
I'm going to copy the code from the accepted answer on the linked question, and without giving credit too. Here it is:
object Optional {
def unapply[T](a: T) = if (null == a) Some(None) else Some(Some(a))
}
val tableNamePattern(database, Optional(schema), tablename) = fullyQualifiedTableName
PS: I just today wondered on twitter whether creating special-case extractors was as common as they were suggested. :)