Dynamic Context free grammar NLTK - python-2.7

Trying to generate sentences with NLTK CFG. Would like to know if it is possible to connect sql database to feed noun and verb in the program below.
In the example below door, window, open, close are hardcoded. How to dynamically ask nltk to look from say excel or database column to feed noun and verb in this particular context?
import nltk
from nltk.parse.generate import generate,demo_grammar
from nltk import CFG
grammar = CFG.fromstring("""
S -> VP NP
NP -> Det N
VP -> V
Det ->'the '
N -> 'door' | 'window'
V -> 'Open' | 'Close'
""")
print(grammar)
for sentence in generate(grammar, n=100):
print(' '.join(sentence))

It seems that you can't dynamically change an NLTK CFG – once it is instantiated, it stays put. You need to define all of the vocabulary immediately when constructing the CFG.
As far as I can see, you have two options to include comprehensive vocabulary from an external resource:
Build up a grammar string as in the example you posted, and use CFG.fromstring() to parse it. You might have to take care of some escaping issues (eg. quotes/apostrophes in the terminal symbols).
Use the CFG constructor directly, providing it a list of productions, eg.:
from nltk import CFG, Production, Nonterminal
prods = [Production(Nonterminal('S'), (Nonterminal('PN'), Nonterminal('V'))),
Production(Nonterminal('PN'), ('Sam',)),
Production(Nonterminal('PN'), ('Fred',)),
Production(Nonterminal('V'), ('sleeps',))]
g = CFG(Nonterminal('S'), prods)
This looks somewhat verbose, but it's probably easier and faster to construct this nested structure of Python datatypes rather than writing a bug-free serialiser for the (more concise) grammar string format.

Related

Why is this ATL helper wrong?

I'm new to ATL and OCL and I'm trying to transform this metamodel:
enter image description here
into this one:
enter image description here
The helper is meant to take all the tests created by the user admin and after that sum the id's of the Actions of that test.
I've done this helper:
helper def: actionsId: Integer = Test!Test.allInstances()->select(i | i.md.user='admin')->collect(n | n.act.id.toInteger())->sum();
But when I run the transformation I'm having this error:
org.eclipse.m2m.atl.engine.emfvm.VMException: Collections do not have properties, use ->collect()
This error is in the collect(n | n.act.id.toInteger()) part of the helper.
The rest of my code is this:
rule Testset2Testcase{
from s: Test!Test
to r: Testcase!Testcase(
ident <- thisModule.actionId.toString(),
date <- s.md.date,
act <- thisModule.resolveTemp(s.act,'a')
)
do{
'Bukatuta'.println();
}
}
rule Action2Activity{
from s: Test!Action
to a: Testcase!Activity(
ident <- s.id
)
}
Sorry for my bad english.
My teacher helped me with this.
The problem was in the helper. Doing this:
helper def: actionsId: Integer = Test!Test.allInstances()->select(i | i.md.user='admin')->collect(n | n.act.id.toInteger())->sum();
I was trying to take the id of a collection of collections of the type Action instead of taking the id of each objects.
With that helper I was taking a collection of collections so using flattener this collection of collections became a collection of Actions.
The helper written in a correct way looks like this:
helper def: actionsId: Integer = Test!Test.allInstances()->select(i | i.md.user='admin')->collect(n | n.act)->flatten()->collect(x | x.id.toInteger())->sum();
Your expression looks plausible, but without your metamodel it is difficult to see where ATL is unhappy about use of a Collection property. If Test::md is a collection, the expression would just be stupid, though not for the reason given.
If ATL's hovertext doesn't help you understand your types, you might enter the same expression into the OCL Xtext Console and carefully hover over "." and "md" to get its accurate type analysis.
But beware, ATL has an independently developed embedded OCL that is not as rich as Eclipse OCL. Perhaps your expression is too complex for ATL; try breaking it up with let's.

Python NLTK Chunking

Using NLTK, I would like to write down a tag pattern to handle something like noun phrases with gerunds and/or coordinated noun. After importing essential libraries, I tokenize my candidate text as follows:
sentences=nltk.word_tokenize('......')
It contains several sentences.
Then I tag it by:
sentences=nltk.pos_tag(sentences)
I also defined my proposed grammar as:
grammar= r"""
Gerunds: {<DT>?<NN>?<VBG><NN>}
Coordinated noun: {<NNP><CC><NNP>|<DT><PRP\$><NNS><CC>
<NNS>|<NN><NNS> <CC><NNS>} """
Then, I employ:
cp=nltk.RegexpParser(grammar);
for sent in sentences:
tree = cp.parse(sent)
for subtree in tree.subtrees():
if subtree.label()=='Gerunds': print(subtree)
print(cp.parse(sentences));
It says ValueError: chunk structures must contain tagged tokens or trees
How should I tackle the problem guys?
I did:
from nltk import word_tokenize, pos_tag
Then, instead of using tree = cp.parse(sent), and print(cp.parse(sentences)), I utilized:
tree = cp.parse(pos_tag(word_tokenize(sentences)))
and
print(cp.parse(pos_tag(word_tokenize(sentences))))
It worked like a charm! :-)

Regular Expressions to Update a Text File in Python

I'm trying to write a script to update a text file by replacing instances of certain characters, (i.e. 'a', 'w') with a word (i.e. 'airplane', 'worm').
If a single line of the text was something like this:
a.function(); a.CallMethod(w); E.aa(w);
I'd want it to become this:
airplane.function(); airplane.CallMethod(worm); E.aa(worm);
The difference is subtle but important, I'm only changing 'a' and 'w' where it's used as a variable, not just another character in some other word. And there's many lines like this in the file. Here's what I've done so far:
original = open('original.js', 'r')
modified = open('modified.js', 'w')
# iterate through each line of the file
for line in original:
# Search for the character 'a' when not part of a word of some sort
line = re.sub(r'\W(a)\W', 'airplane', line)
modified.write(line)
original.close()
modified.close()
I think my RE pattern is wrong, and I think i'm using the re.sub() method incorrectly as well. Any help would be greatly appreciated.
If you're concerned about the semantic meaning of the text you're changing with a regular expression, then you'd likely be better served by parsing it instead. Luckily python has two good modules to help you with parsing Python. Look at the Abstract Syntax Tree and the Parser modules. There's probably others for JavaScript if that's what you're doing; like slimit.
Future reference on Regular Expression questions, there's a lot of helpful information here:
https://stackoverflow.com/tags/regex/info
Reference - What does this regex mean?
And it took me 30 minutes from never having used this JavaScript parser in Python (replete with installation issues: please note the right ply version) to writing a basic solution given your example. You can too.
# Note: sudo pip3 install ply==3.4 && sudo pip3 install slimit
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor
data = 'a.funktion(); a.CallMethod(w); E.aa(w);'
tree = Parser().parse(data)
for node in nodevisitor.visit(tree):
if isinstance(node, ast.Identifier):
if node.value == 'a':
node.value = 'airplaine'
elif node.value == 'w':
node.value = 'worm'
print(tree.to_ecma())
It runs to give this output:
$ python3 src/python_renames_js_test.py
airplaine.funktion();
airplaine.CallMethod(worm);
E.aa(worm);
Caveats:
function is a reserved word, I used funktion
the to_ecma method pretty prints; there is likely another way to output it closer to the original input.
line = re.sub(r'\ba\b', 'airplane', line)
should get you closer. However, note that you will also be replacing a.CallMethod("That is a house") into airplane("That is airplane house"), and open("file.txt", "a") into open("file.txt", "airplane"). Getting it right in a complex syntax environment using RegExp is hard-to-impossible.

Using Regex in Pig in hadoop

I have a CSV file containing user (tweetid, tweets, userid).
396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse",Obey_Jony09
396124436740317184,"“#BleacherReport: Halloween has given us this amazing Derrick Rose photo (via #amandakaschube, #ScottStrazzante) http://t.co/tM0wEugZR1” yes",Colten_stamkos
396124436845178880,"When's 12.4k gonna roll around",Matty_T_03
Now I need to write a Pig Query that returns all the tweets that include the word 'favorite', ordered by tweet id.
For this I have the following code:
A = load '/user/pig/tweets' as (line);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)[,”:-](.*)[“,:-](.*)')) AS (tweetid:long,msg:chararray,userid:chararray);
C = filter B by msg matches '.*favorite.*';
D = order C by tweetid;
How does the regular expression work here in splitting the output in desired way?
I tried using REGEX_EXTRACT instead of REGEX_EXTRACT_ALL as I find that much more simpler, but couldn't get the code working except for extracting just the tweets:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'[,”:-](.*)[“,:-]',1)) AS (msg:chararray);
the above alias gets me the tweets, but if I use REGEX_EXTRACT to get the tweet_id, I do not get the desired o/p: B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'(.*)[,”:-]',1)) AS (tweetid:long);
(396124554353197056,"Just saw #samantha0wen and #DakotaFears at the drake concert #waddup")
(396124554172432384,"#Yutika_Diwadkar I'm just so bright 😁")
(396124554609033216,"#TB23GMODE i don't know, i'm just saying, why you in GA though? that's where you from?")
(396124554805776385,"#MichaelThe_Lion me too 😒")
(396124552540852226,"Happy Halloween from us 2 #maddow & #Rev_AlSharpton :) http://t.co/uC35lDFQYn")
grunt>
Please help.
Can't comment, but from looking at this and testing it out, it looks like your quotes in the regex are different from those in the csv.
" in the csv
” in the regex code.
To get the tweetid try this:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'.*(,")',1)) AS (tweetid:long);

Is there a way to filter a django queryset based on string similarity (a la python difflib)?

I have a need to match cold leads against a database of our clients.
The leads come from a third party provider in bulk (thousands of records) and sales is asking us to (in their words) "filter out our clients" so they don't try to sell our service to a established client.
Obviously, there are misspellings in the leads. Charles becomes Charlie, Joseph becomes Joe, etc. So I can't really just do a filter comparing lead_first_name to client_first_name, etc.
I need to use some sort of string similarity mechanism.
Right now I'm using the lovely difflib to compare the leads' first and last names to a list generated with Client.objects.all(). It works, but because of the number of clients it tends to be slow.
I know that most sql databases have soundex and difference functions. See my test of it in the update below - it doesn't work as well as difflib.
Is there another solution? Is there a better solution?
Edit:
Soundex, at least in my db, doesn't behave as well as difflib.
Here is a simple test - look for "Joe Lopes" in a table containing "Joseph Lopes":
with temp (first_name, last_name) as (
select 'Joseph', 'Lopes'
union
select 'Joe', 'Satriani'
union
select 'CZ', 'Lopes'
union
select 'Blah', 'Lopes'
union
select 'Antonio', 'Lopes'
union
select 'Carlos', 'Lopes'
)
select first_name, last_name
from temp
where difference(first_name+' '+last_name, 'Joe Lopes') >= 3
order by difference(first_name+' '+last_name, 'Joe Lopes')
The above returns "Joe Satriani" as the only match. Even reducing the similarity threshold to 2 doesn't return "Joseph Lopes" as a potential match.
But difflib does a much better job:
difflib.get_close_matches('Joe Lopes', ['Joseph Lopes', 'Joe Satriani', 'CZ Lopes', 'Blah Lopes', 'Antonio Lopes', 'Carlos Lopes'])
['Joseph Lopes', 'CZ Lopes', 'Carlos Lopes']
Edit after gruszczy's response:
Before writing my own, I looked for and found a T-SQL implementation of Levenshtein Distance in the repository of all knowledge.
In testing it, it still won't do a better matching job than difflib.
Which led me to research what algorithm is behind difflib. It seems to be a modified version of the Ratcliff-Obershelp algorithm.
Unhappily I can't seem to find some other kind soul who has already created a T-SQL implementation based on difflib's... I'll try my hand at it when I can.
If nobody else comes up with a better answer in the next few days, I'll grant it to gruszczy. Thanks, kind sir.
soundex won't help you, because it's a phonetic algorithm. Joe and Joseph aren't similar phonetically, so soundex won't mark them as similar.
You can try Levenshtein distance, which is implemented in PostgreSQL. Maybe in your database too and if not, you should be able to write a stored procedure, which will calculate the distance between two strings and use it in your computation.
It's possible with trigram_similar lookups since Django 1.10, see docs for PostgreSQL specific lookups and Full text search
As per the answer of andilabs you can use the Levenshtein function to create your custom function. Postgres doc indicates that the Levenshtein function is as follows:
levenshtein(text source, text target, int ins_cost, int del_cost, int sub_cost) returns int
levenshtein(text source, text target) returns int
andilabs answer can use the only second function. If you want a more advanced search with insertion/deletion/substitution costs, you can rewrite function like this:
from django.db.models import Func
class Levenshtein(Func):
template = "%(function)s(%(expressions)s, '%(search_term)s', %(ins_cost)d, %(del_cost)d, %(sub_cost)d)"
function = 'levenshtein'
def __init__(self, expression, search_term, ins_cost=1, del_cost=1, sub_cost=1, **extras):
super(Levenshtein, self).__init__(
expression,
search_term=search_term,
ins_cost=ins_cost,
del_cost=del_cost,
sub_cost=sub_cost,
**extras
)
And call the function:
from django.db.models import F
Spot.objects.annotate(
lev_dist=Levenshtein(F('name'), 'Kfaka', 3, 3, 1) # ins = 3, del = 3, sub = 1
).filter(
lev_dist__lte=2
)
If you need getting there with django and postgres and don't want to use introduced in 1.10 trigram-similarity https://docs.djangoproject.com/en/2.0/ref/contrib/postgres/lookups/#trigram-similarity you can implement using Levensthein like these:
Extension needed fuzzystrmatch
you need adding postgres extension to your db in psql:
CREATE EXTENSION fuzzystrmatch;
Lets define custom function with wich we can annotate queryset. It just take one argument the search_term and uses postgres levenshtein function (see docs):
from django.db.models import Func
class Levenshtein(Func):
template = "%(function)s(%(expressions)s, '%(search_term)s')"
function = "levenshtein"
def __init__(self, expression, search_term, **extras):
super(Levenshtein, self).__init__(
expression,
search_term=search_term,
**extras
)
then in any other place in project we just import defined Levenshtein and F to pass the django field.
from django.db.models import F
Spot.objects.annotate(
lev_dist=Levenshtein(F('name'), 'Kfaka')
).filter(
lev_dist__lte=2
)