Generating sentences with NLTK and Python - python-2.7

I'm having trouble using NLTK to generate random sentences from a custom corpus.
Before I start, I'd like to mention that I'm using NLTK version 2x, so the "generate" function is still existent.
Here is my current code:
file = open('corpus/romeo and juliet.txt','r')
words = file.read()
tokens = nltk.word_tokenize(words)
text = nltk.Text(tokens)
print text.generate(length=10)
This runs, but does not create random sentences (I'm going for a horse_ebooks vibe). Instead, it returns me the first 10 words of my corpus source every time.
However, if I use NLTK's brown corpus, I get the desired random effect.
text = nltk.Text(nltk.corpus.brown.words())
print text.generate(length=10)
Going into the Brown corpus files, it seems as though every word is separated and tagged with verbs, adjectives, etc - something I thought would be completed with the word_tokenize function of my first block of code.
Is there a way to generate a corpus like the Brown example - even if it means converting my txt documents into that fashion instead of reading them directly?
Any help would be appreciated - any documents on this are either horribly outdated or just say to use Markov Chains (which I have, but I want to figure this out!) I understand generate() was removed as of NLTK 3.0 because of bugs.

Related

Regular Expressions - Read through Text Doc and Extract Sentences with a Specific Word

I have a series of large text documents. I need to read through them and - if a particular word appears - extract the entire sentence.
So, if I'm searching for the word wobble and a sentence in the document is Weebles wobble but they don't fall down, I want to extract that sentence.
What is the most efficient way to do this?
I can think of two approaches to this:
Search the document for the word, then extract the particular sentence; or
Iterate through each sentence in the document. Check each sentence for the word. If the sentence has the word extract the sentence.
I would think 1 is more computationally efficient than 2. But not sure what the syntax would be.
Is there another approach I'm not considering?
Any help on efficiency and syntax appreciated.
you first need to get proper sentences from text document the best way of doing that is using nltk.data tokenizer first make sure that you have installed python nltk library properly.
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
txt = open("txt_file.txt")
data = txt.read()
all_sentences = tokenizer.tokenize(data)
required_sentences = []
for each_sentence in all_sentences:
if 'wobble' in each_sentence:
required_sentences.append(each_sentence)
print(required_sentences)

SpaCy questions about lists of lists in Python 2.7

I think part of my issue has to do with spaCy and part has to do with not understanding the most elegant way to work within python itself.
I am uploading a txt file in python, tokenizing it into sentences and then tokenizing that into words with nltk:
sent_text = nltk.sent_tokenize(text)
tokenized_text = [nltk.word_tokenize(x) for x in sent_text]
That gives me a list of lists, where each list within the main list is a sentence of tokenized words. So far so good.
I then run it through SpaCy:
text = nlp(unicode(tokenized_text))
Still a list of lists, same thing, but it has all the SpaCy info.
This is where I'm hitting a block. Basically what I want to do is, for each sentence, only retain the nouns, verbs, and adjectives, and within those, also get rid of auxiliaries and conjunctions. I was able to do this earlier by creating a new empty list and appending only what I want:
sent11 = []
for token in sent1:
if (token.pos_ == 'NOUN' or token.pos_ == 'VERB' or token.pos_ =='ADJ') and (token.dep_ != 'aux') and (token.dep_ != 'conj'):
sent11.append(token)
This works fine for a single sentence, but I don't want to be doing it for every single sentence in a book-length text.
Then, once I have these new lists (or whatever the best way to do it is) containing only the pieces I want, I want to use the "similarity" function of SpaCy to determine which sentence is closest semantically to some other, much shorter text that I've done the same stripping of everything but nouns, adj, verbs, etc to.
I've got it working when comparing one single sentence to another by using:
sent1.similarity(sent2)
So I guess my questions are
1) What is the best way to turn a list of lists into a list of lists that only contain the pieces I want?
and
2) How do I cycle through this new list of lists and compare each one to a separate sentence and return the sentence that is most semantically similar (using the vectors that SpaCy comes with)?
You're asking a bunch of questions here so I'm going to try to break them down.
Is nearly duplicating a book-length amount of text by appending each word to a list bad?
How can one eliminate or remove elements of a list efficiently?
How can one compare a sentence to each sentence in the book where each sentence is a list and the book is a list of sentences.
Answers:
Generally yes, but on a modern system it isn't a big deal. Books are text which are probably just UTF-8 characters if English, otherwise they might be Unicode. A UTF-8 character is a byte and even a long book such as War and Peace comes out to under 3.3 Mb. If you are using chrome, firefox, or IE to view this page your computer has more than enough memory to fit a few copies of it into ram.
In python you can't really.
You can do removal using:
l = [1,2,3,4]
del l[-2]
print(l)
[1,2,4]
but in the background python is copying every element of that list over one. It is not recommended for large lists. Instead using a dequeue which implements itself as a doublely-linked-list has a bit of extra overhead but allows for efficient removal of elements in the middle.
If memory is an issue then you can also use generators wherever possible. For example you could probably change:
tokenized_text = [nltk.word_tokenize(x) for x in sent_text]
which creates a list that contains tokens of the entire book, with
tokenized_text = (nltk.word_tokenize(x) for x in sent_text)
which creates a generator that yields tokens of the entire book. Generators have almost no memory overhead and instead compute the next element as they go.
I'm not familiar with SpaCy, and while the question fits on SO you're unlikely to get good answers about specific libraries here.
From the looks of it you can just do something like:
best_match = None
best_similarity_value = 0
for token in parsed_tokenized_text:
similarity = token.similarity(sent2)
if similarity > best_similarity_value:
best_similarity_value = similarity
best_match = token
And if you wanted to check against multiple sentences (non-consecutive) then you could put an outer loop that goes through those:
for sent2 in other_token_list:

filename sequence extraction python

I have to identify and isolate a number sequence from the file names in a folder of files, and optionally, identify non-continuous sequences. The files are .dpx files. There is almost no file naming structure except that somewhere in the filename is a sequence number, and an extention of '.dpx'. There is a wonderful module called PySeq that can do all of the hard work, except it just bombs with a directory of thousands and sometimes hundreds of thousands of files. "Argument list too large". Has anyone had experience working with sequence number isolation and dpx files in particular? Each file can be up to 100MB in size. I am working on a CentOS box using Python2.7.
File names might be something like:<br/>
test00_take1_00001.dpx<br/>
test00_take1_00002.dpx<br/>
another_take_ver1-0001_3.dpx<br/>
another_take_ver1-0002_3.dpx<br/>
(Two continuous sequences)
This should do exactly what you're looking for. It will create a dict of dicts containing start and end of strings and putting the full string in a list.
It will then join all of the lists into a single list (You might as well skip on this part and turn it into a generator of lists for higher efficiency regarding memory)
from collections import defaultdict
input_list = [
"test00_take1_00001.dpx",
"test00_take1_00002.dpx",
"another_take_ver1-0001_3.dpx",
"another_take_ver1-0002_3.dpx"]
results_dict = defaultdict(lambda: defaultdict(list))
matches = (re.match(r"(.*?[\W_])\d+([\W_].*)", item) for item in input_list)
for match in matches:
results_dict[match.group(1)][match.group(2)].append(match.group(0))
results_list = [d2 for d1 in results_dict.values() for d2 in d1.values()]
>>> results_list
[['another_take_ver1-0001_3.dpx', 'another_take_ver1-0002_3.dpx'], ['test00_take
1_00001.dpx', 'test00_take1_00002.dpx']]

Pygments syntax highlighter in python tkinter text widget

I have been trying to incorporate syntax highlighting with the tkinter text widget. However, using the code found on this post, I cannot get it to work. There are no errors, but the text is not highlighted and a line is skipped after each character. If there is a better way to incorporate syntax highlighting with the tkinter text widget, I would be happy to hear it. Here is the smallest code I could find that replicates the issue:
import Tkinter
import ScrolledText
from pygments import lex
from pygments.lexers import PythonLexer
root = Tkinter.Tk(className=" How do I put an end to this behavior?")
textPad = ScrolledText.ScrolledText(root, width=100, height=80)
textPad.tag_configure("Token.Comment", foreground="#b21111")
code = textPad.get("1.0", "end-1c")
# Parse the code and insert into the widget
def syn(event=None):
for token, content in lex(code, PythonLexer()):
textPad.insert("end", content, str(token))
textPad.pack()
root.bind("<Key>", syn)
root.mainloop()
So far, I have not found a solution to this problem (otherwise I would not be posting here). Any help regarding syntax highlighting a tkinter text widget would be appreciated.
Note: This is on python 2.7 with Windows 7.
The code in the question you linked to was designed more for highlighting already existing text, whereas it looks like you're trying to highlight it as you type.
I can give some suggestions to get you started, though I've never done this and don't know what the most efficient solution is. The solution in this answer is only a starting point, there's no guarantee it is actually suited to your problem.
The short synopis is this: don't set up a binding that inserts anything. Instead, just highlight what was inserted by the default bindings.
To do this, the first step is to bind on <KeyRelease> rather than <Key>. The difference is that <KeyRelease> will happen after a character has been inserted whereas <Key> happens before a character is inserted.
Second, you need to get tokens from the lexer, and apply tags to the text for each token. To do that you need to keep track where in the document the lexer is, and then use the length of the token to determine the end of the token.
In the following solution I create a mark ("range_start") to designate the current location in the file where the pygments lexer is, and then compute the mark "range_end" based on the start, and the length of the token returned by pygments. I don't know how robust this is in the face of multi-byte characters. For now, lets assume single byte characters.
def syn(event=None):
textPad.mark_set("range_start", "1.0")
data = textPad.get("1.0", "end-1c")
for token, content in lex(data, PythonLexer()):
textPad.mark_set("range_end", "range_start + %dc" % len(content))
textPad.tag_add(str(token), "range_start", "range_end")
textPad.mark_set("range_start", "range_end")
This is crazy inefficient since it re-applies the highlighting to the whole document on every keypress. There are ways to minimize that, such as only highlighting after each word, or when the GUI goes idle, or some other sort of trigger.
To highlight certain words you can do this :
textarea.tag_remove("tagname","1.0",tkinter.END)
first = "1.0"
while(True):
first = textarea.search("word_you_are_looking_for", first, nocase=False, stopindex=tkinter.END)
if not first:
break
last = first+"+"+str(len("word_you_are_looking_for"))+"c"
textarea.tag_add("tagname", first, last)
first = last
textarea.tag_config("tagname", foreground="#00FF00")

Automatically finding numbering patterns in filenames

Intro
I work in a facility where we have microscopes. These guys can be asked to generate 4D movies of a sample: they take e.g. 10 pictures at different Z position, then wait a certain amount of time (next timepoint) and take 10 slices again.
They can be asked to save a file for each slice, and they use an explicit naming pattern, something like 2009-11-03-experiment1-Z07-T42.tif. The file names are numbered to reflect the Z position and the time point
Question
Once you have all these file names, you can use a regex pattern to extract the Z and T value, if you know the backbone pattern of the file name. This I know how to do.
The question I have is: do you know a way to automatically generate regex pattern from the file name list? For instance, there is an awesome tool on the net that does similar thing: txt2re.
What algorithm would you use to parse all the file name list and generate a most likely regex pattern?
There is a Perl module called String::Diff which has the ability to generate a regular expression for two different strings. The example it gives is
my $diff = String::Diff::diff_regexp('this is Perl', 'this is Ruby');
print "$diff\n";
outputs:
this\ is\ (?:Perl|Ruby)
Maybe you could feed pairs of filenames into this kind of thing to get an initial regex. However, this wouldn't give you capturing of numbers etc. so it wouldn't be completely automatic. After getting the diff you would have to hand-edit or do some kind of substitution to get a working final regex.
First of all, you are trying to do this the hard way. I suspect that this may not be impossible but you would have to apply some artificial intelligence techniques and it would be far more complicated than it is worth. Either neural networks or a genetic algorithm system could be trained to recognize the Z numbers and T numbers, assuming that the format of Z[0-9]+ and T[0-9]+ is always used somewhere in the regex.
What I would do with this problem is to write a Python script to process all of the filenames. In this script, I would match twice against the filename, one time looking for Z[0-9]+ and one time looking for T[0-9]+. Each time I would count the matches for Z-numbers and T-numbers.
I would keep four other counters with running totals, two for Z-numbers and two for T-numbers. Each pair would represent the count of filenames with 1 match, and the ones with multiple matches. And I would count the total number of filenames processed.
At the end, I would report as follows:
nnnnnnnnnn filenames processed
Z-numbers matched only once in nnnnnnnnnn filenames.
Z-numbers matched multiple times in nnnnnn filenames.
T-numbers matched only once in nnnnnnnnnn filenames.
T-numbers matched multiple times in nnnnnn filenames.
If you are lucky, there will be no multiple matches at all, and you could use the regexes above to extract your numbers. However, if there are any significant number of multiple matches, you can run the script again with some print statements to show you example filenames that provoke a multiple match. This would tell you whether or not a simple adjustment to the regex might work.
For instance, if you have 23,768 multiple matches on T-numbers, then make the script print every 500th filename with multiple matches, which would give you 47 samples to examine.
Probably something like [ -/.=]T[0-9]+[ -/.=] would be enough to get the multiple matches down to zero, while also giving a one-time match for every filename. Or at worst, [0-9][ -/.=]T[0-9]+[ -/.=]
For Python, see this question about TemplateMaker.