Python NLTK Chunking - python-2.7

Using NLTK, I would like to write down a tag pattern to handle something like noun phrases with gerunds and/or coordinated noun. After importing essential libraries, I tokenize my candidate text as follows:
sentences=nltk.word_tokenize('......')
It contains several sentences.
Then I tag it by:
sentences=nltk.pos_tag(sentences)
I also defined my proposed grammar as:
grammar= r"""
Gerunds: {<DT>?<NN>?<VBG><NN>}
Coordinated noun: {<NNP><CC><NNP>|<DT><PRP\$><NNS><CC>
<NNS>|<NN><NNS> <CC><NNS>} """
Then, I employ:
cp=nltk.RegexpParser(grammar);
for sent in sentences:
tree = cp.parse(sent)
for subtree in tree.subtrees():
if subtree.label()=='Gerunds': print(subtree)
print(cp.parse(sentences));
It says ValueError: chunk structures must contain tagged tokens or trees
How should I tackle the problem guys?

I did:
from nltk import word_tokenize, pos_tag
Then, instead of using tree = cp.parse(sent), and print(cp.parse(sentences)), I utilized:
tree = cp.parse(pos_tag(word_tokenize(sentences)))
and
print(cp.parse(pos_tag(word_tokenize(sentences))))
It worked like a charm! :-)

Related

Regular expression search and substitute values in CSV file

I want to find and replace all of the Managerial positions in a CSV file with number 3. The list contains different positions from simple ",Manager," to ",Construction Project Manager and Project Superintendent," but all of them are placed between two commas. I wrote this to find them all:
[,\s]?([A-Za-z. '\s/()\"]+)?(Manager|manager)([A-Za-z. '\s/()]+)?,
The Problem is that sometimes a comma is common between two adjacent Managrial position. So I need to include comma when I want to find the positions but I need to exclude it when I want to replace the position with 3! How Can I do that with a regular expression in Python?
Here is the CSV file.
I suggest using Python's built-in CSV module instead. Let's not reinvent the wheel here and consider handling CSV as a solved problem.
Here is some sample code that demonstrates how it can be done: The csv module is responsible for reading and writing the file with the correct delimiter and quotation char.
re.search is used to search individual cells/columns for your keyword. If manager is found, put a 3, otherwise, put the original content and write the row back when done.
import csv, sys, re
infile= r'in.csv'
outfile= r'out.csv'
o = open(outfile, 'w', newline='')
csvwri = csv.writer(o, delimiter=',', quotechar='\"', quoting=csv.QUOTE_MINIMAL)
with open(infile, newline='') as f:
reader = csv.reader(f, delimiter=',', quotechar='\"', quoting=csv.QUOTE_MINIMAL)
try:
for row in reader:
newrow = []
for col in row:
if re.search("manager", col, re.I):
newrow.append("3")
else:
newrow.append(col)
csvwri.writerow(newrow)
except csv.Error as e:
sys.exit('file {}, line {}: {}'.format(infile, reader.line_num, e))
o.flush()
o.close()
Straightforward and clean, I would say.
If you insist on using a regex, here's an improved pattern:
[,\s]?([A-Za-z. '\s/()\"]+)?(Manager|manager)([A-Za-z. '\s/()]+)?(?=,)
Replace with 3, as shown in the demo.
However, I believe you are still better off with the csv lib approach.

Matching PoS tags with specific text with `testacy.extract.pos_regex_matches(...)`

I'm using textacy's pos_regex_matches method to find certain chunks of text in sentences.
For instance, assuming I have the text: Huey, Dewey, and Louie are triplet cartoon characters., I'd like to detect that Huey, Dewey, and Louie is an enumeration.
To do so, I use the following code (on testacy 0.3.4, the version available at the time of writing):
import textacy
sentence = 'Huey, Dewey, and Louie are triplet cartoon characters.'
pattern = r'<PROPN>+ (<PUNCT|CCONJ> <PUNCT|CCONJ>? <PROPN>+)*'
doc = textacy.Doc(sentence, lang='en')
lists = textacy.extract.pos_regex_matches(doc, pattern)
for list in lists:
print(list.text)
which prints:
Huey, Dewey, and Louie
However, if I have something like the following:
sentence = 'Donald Duck - Disney'
then the - (dash) is recognised as <PUNCT> and the whole sentence is recognised as a list -- which it isn't.
Is there a way to specify that only , and ; are valid <PUNCT> for lists?
I've looked for some reference about this regex language for matching PoS tags with no luck, can anybody help? Thanks in advance!
PS: I tried to replace <PUNCT|CCONJ> with <[;,]|CCONJ>, <;,|CCONJ>, <[;,]|CCONJ>, <PUNCT[;,]|CCONJ>, <;|,|CCONJ> and <';'|','|CCONJ> as suggested in the comments, but it didn't work...
Is short, it is not possible: see this official page.
However the merge request contains the code of the modified version described in the page, therefore one can recreate the functionality, despite it's less performing than using a SpaCy's Matcher (see code and example -- though I have no idea how to reimplement my problem using a Matcher).
If you want to go down this lane anyway, you have to change the line:
words.extend(map(lambda x: re.sub(r'\W', '', x), keyword_map[w]))
with the following:
words.extend(keyword_map[w])
otherwise every symbol (like , and ; in my case) will be stripped off.

What's the best way to match strings in a file to case class in Scala?

We have a file that contains data that we want to match to a case class. I know enough to brute force it but looking for an idiomatic way in scala.
Given File:
#record
name:John Doe
age: 34
#record
name: Smith Holy
age: 33
# some comment
#record
# another comment
name: Martin Fowler
age: 99
(field values on two lines are INVALID, e.g. name:John\n Smith should error)
And the case class
case class Record(name:String, age:Int)
I Want to return a Seq type such as Stream:
val records: Stream records
The couple of ideas i'm working with but so far haven't implemented is:
Remove all new lines and treat the whole file as one long string. Then grep match on the string "((?!name).)+((?!age).)+age:([\s\d]+)" and create a new object of my case class for each match but so far my regex foo is low and can't match around comments.
Recursive idea: Iterate through each line to find the first line that matches record, then recursively call the function to match name, then age. Tail recursively return Some(new Record(cumulativeMap.get(name), cumulativeMap.get(age)) or None when hitting the next record after name (i.e. age was never encountered)
?? Better Idea?
Thanks for reading! The file is more complicated than above but all rules are equal. For the curious: i'm trying to parse a custom M3U playlist file format.
I'd use kantan.regex for a fairly trivial regex based solution.
Without fancy shapeless derivation, you can write the following:
import kantan.regex._
import kantan.regex.implicits._
case class Record(name:String, age:Int)
implicit val decoder = MatchDecoder.ordered(Record.apply _)
input.evalRegex[Record](rx"(?:name:\s*([^\n]+))\n(?:age:\s*([0-9]+))").toList
This yields:
List(Success(Record(John Doe,34)), Success(Record(Smith Holy,33)), Success(Record(Martin Fowler,99)))
Note that this solution requires you to hand-write decoder, but it can often be automatically derived. If you don't mind a shapeless dependency, you could simply write:
import kantan.regex._
import kantan.regex.implicits._
import kantan.regex.generic._
case class Record(name:String, age:Int)
input.evalRegex[Record](rx"(?:name:\s*([^\n]+))\n(?:age:\s*([0-9]+))").toList
And get the exact same result.
Disclaimer: I'm the library's author.
You could use Parser Combinators.
If you have the file format specification in BNF or can write one, then Scala can create a parser for you from those rules. This may be more robust than hand-made regex based parsers. It's certainly more "Scala".
I don't have much experience in Scala, but could these regexes work:
You could use (?<=name:).* to match name value, and (?<=age:).* to match the age value. If you use this, remove spaces in found matches, otherwise name: bob will match bob with a space before, you might not want that.
If name: or any other tag is in comment, or comment is after value, something will be matched. Please leave a comment if you want to avoid that.
You could try this:
Path file = Paths.get("file.txt");
val lines = Files.readAllLines(file, Charset.defaultCharset());
val records = lines.filter(s => s.startsWith("age:") || s.startsWith("name:"))
.grouped(2).toList.map {
case List(a, b) => Record(a.replaceAll("name:", "").trim,
b.replaceAll("age:", "").trim.toInt)
}

Regular Expressions to Update a Text File in Python

I'm trying to write a script to update a text file by replacing instances of certain characters, (i.e. 'a', 'w') with a word (i.e. 'airplane', 'worm').
If a single line of the text was something like this:
a.function(); a.CallMethod(w); E.aa(w);
I'd want it to become this:
airplane.function(); airplane.CallMethod(worm); E.aa(worm);
The difference is subtle but important, I'm only changing 'a' and 'w' where it's used as a variable, not just another character in some other word. And there's many lines like this in the file. Here's what I've done so far:
original = open('original.js', 'r')
modified = open('modified.js', 'w')
# iterate through each line of the file
for line in original:
# Search for the character 'a' when not part of a word of some sort
line = re.sub(r'\W(a)\W', 'airplane', line)
modified.write(line)
original.close()
modified.close()
I think my RE pattern is wrong, and I think i'm using the re.sub() method incorrectly as well. Any help would be greatly appreciated.
If you're concerned about the semantic meaning of the text you're changing with a regular expression, then you'd likely be better served by parsing it instead. Luckily python has two good modules to help you with parsing Python. Look at the Abstract Syntax Tree and the Parser modules. There's probably others for JavaScript if that's what you're doing; like slimit.
Future reference on Regular Expression questions, there's a lot of helpful information here:
https://stackoverflow.com/tags/regex/info
Reference - What does this regex mean?
And it took me 30 minutes from never having used this JavaScript parser in Python (replete with installation issues: please note the right ply version) to writing a basic solution given your example. You can too.
# Note: sudo pip3 install ply==3.4 && sudo pip3 install slimit
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor
data = 'a.funktion(); a.CallMethod(w); E.aa(w);'
tree = Parser().parse(data)
for node in nodevisitor.visit(tree):
if isinstance(node, ast.Identifier):
if node.value == 'a':
node.value = 'airplaine'
elif node.value == 'w':
node.value = 'worm'
print(tree.to_ecma())
It runs to give this output:
$ python3 src/python_renames_js_test.py
airplaine.funktion();
airplaine.CallMethod(worm);
E.aa(worm);
Caveats:
function is a reserved word, I used funktion
the to_ecma method pretty prints; there is likely another way to output it closer to the original input.
line = re.sub(r'\ba\b', 'airplane', line)
should get you closer. However, note that you will also be replacing a.CallMethod("That is a house") into airplane("That is airplane house"), and open("file.txt", "a") into open("file.txt", "airplane"). Getting it right in a complex syntax environment using RegExp is hard-to-impossible.

Dynamic Context free grammar NLTK

Trying to generate sentences with NLTK CFG. Would like to know if it is possible to connect sql database to feed noun and verb in the program below.
In the example below door, window, open, close are hardcoded. How to dynamically ask nltk to look from say excel or database column to feed noun and verb in this particular context?
import nltk
from nltk.parse.generate import generate,demo_grammar
from nltk import CFG
grammar = CFG.fromstring("""
S -> VP NP
NP -> Det N
VP -> V
Det ->'the '
N -> 'door' | 'window'
V -> 'Open' | 'Close'
""")
print(grammar)
for sentence in generate(grammar, n=100):
print(' '.join(sentence))
It seems that you can't dynamically change an NLTK CFG – once it is instantiated, it stays put. You need to define all of the vocabulary immediately when constructing the CFG.
As far as I can see, you have two options to include comprehensive vocabulary from an external resource:
Build up a grammar string as in the example you posted, and use CFG.fromstring() to parse it. You might have to take care of some escaping issues (eg. quotes/apostrophes in the terminal symbols).
Use the CFG constructor directly, providing it a list of productions, eg.:
from nltk import CFG, Production, Nonterminal
prods = [Production(Nonterminal('S'), (Nonterminal('PN'), Nonterminal('V'))),
Production(Nonterminal('PN'), ('Sam',)),
Production(Nonterminal('PN'), ('Fred',)),
Production(Nonterminal('V'), ('sleeps',))]
g = CFG(Nonterminal('S'), prods)
This looks somewhat verbose, but it's probably easier and faster to construct this nested structure of Python datatypes rather than writing a bug-free serialiser for the (more concise) grammar string format.