Lemmatizing Italian sentences for frequency counting - python-2.7

I would like to lemmatize some Italian text in order to perform some frequency counting of words and further investigations on the output of this lemmatized content.
I am preferring lemmatizing than stemming because I could extract the word meaning from the context in the sentence (e.g. distinguish between a verb and a noun) and obtain words that exist in the language, rather than roots of those words that don't usually have a meaning.
I found out this library called pattern (pip2 install pattern) that should complement nltk in order to perform lemmatization of the Italian language, however I am not sure the approach below is correct because each word is lemmatized by itself, not in the context of a sentence.
Probably I should give pattern the responsibility to tokenize a sentence (so also annotating each word with the metadata regarding verbs/nouns/adjectives etc), then retrieving the lemmatized word, but I am not able to do this and I am not even sure it is possible at the moment?
Also: in Italian some articles are rendered with an apostrophe so for example "l'appartamento" (in English "the flat") is actually 2 words: "lo" and "appartamento". Right now I am not able to find a way to split these 2 words with a combination of nltk and pattern so then I am not able to count the frequency of the words in the correct way.
import nltk
import string
import pattern
# dictionary of Italian stop-words
it_stop_words = nltk.corpus.stopwords.words('italian')
# Snowball stemmer with rules for the Italian language
ita_stemmer = nltk.stem.snowball.ItalianStemmer()
# the following function is just to get the lemma
# out of the original input word (but right now
# it may be loosing the context about the sentence
# from where the word is coming from i.e.
# the same word could either be a noun/verb/adjective
# according to the context)
def lemmatize_word(input_word):
in_word = input_word#.decode('utf-8')
# print('Something: {}'.format(in_word))
word_it = pattern.it.parse(
in_word,
tokenize=False,
tag=False,
chunk=False,
lemmata=True
)
# print("Input: {} Output: {}".format(in_word, word_it))
the_lemmatized_word = word_it.split()[0][0][4]
# print("Returning: {}".format(the_lemmatized_word))
return the_lemmatized_word
it_string = "Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo. Stasera mangio la pizza con le verdure."
# 1st tokenize the sentence(s)
word_tokenized_list = nltk.tokenize.word_tokenize(it_string)
print("1) NLTK tokenizer, num words: {} for list: {}".format(len(word_tokenized_list), word_tokenized_list))
# 2nd remove punctuation and everything lower case
word_tokenized_no_punct = [string.lower(x) for x in word_tokenized_list if x not in string.punctuation]
print("2) Clean punctuation, num words: {} for list: {}".format(len(word_tokenized_no_punct), word_tokenized_no_punct))
# 3rd remove stop words (for the Italian language)
word_tokenized_no_punct_no_sw = [x for x in word_tokenized_no_punct if x not in it_stop_words]
print("3) Clean stop-words, num words: {} for list: {}".format(len(word_tokenized_no_punct_no_sw), word_tokenized_no_punct_no_sw))
# 4.1 lemmatize the words
word_tokenize_list_no_punct_lc_no_stowords_lemmatized = [lemmatize_word(x) for x in word_tokenized_no_punct_no_sw]
print("4.1) lemmatizer, num words: {} for list: {}".format(len(word_tokenize_list_no_punct_lc_no_stowords_lemmatized), word_tokenize_list_no_punct_lc_no_stowords_lemmatized))
# 4.2 snowball stemmer for Italian
word_tokenize_list_no_punct_lc_no_stowords_stem = [ita_stemmer.stem(i) for i in word_tokenized_no_punct_no_sw]
print("4.2) stemmer, num words: {} for list: {}".format(len(word_tokenize_list_no_punct_lc_no_stowords_stem), word_tokenize_list_no_punct_lc_no_stowords_stem))
# difference between stemmer and lemmatizer
print(
"For original word(s) '{}' and '{}' the stemmer: '{}' '{}' (count 1 each), the lemmatizer: '{}' '{}' (count 2)"
.format(
word_tokenized_no_punct_no_sw[1],
word_tokenized_no_punct_no_sw[6],
word_tokenize_list_no_punct_lc_no_stowords_stem[1],
word_tokenize_list_no_punct_lc_no_stowords_stem[6],
word_tokenize_list_no_punct_lc_no_stowords_lemmatized[1],
word_tokenize_list_no_punct_lc_no_stowords_lemmatized[1]
)
)
Gives this output:
1) NLTK tokenizer, num words: 20 for list: ['Ieri', 'sono', 'andato', 'in', 'due', 'supermercati', '.', 'Oggi', 'volevo', 'andare', "all'ippodromo", '.', 'Stasera', 'mangio', 'la', 'pizza', 'con', 'le', 'verdure', '.']
2) Clean punctuation, num words: 17 for list: ['ieri', 'sono', 'andato', 'in', 'due', 'supermercati', 'oggi', 'volevo', 'andare', "all'ippodromo", 'stasera', 'mangio', 'la', 'pizza', 'con', 'le', 'verdure']
3) Clean stop-words, num words: 12 for list: ['ieri', 'andato', 'due', 'supermercati', 'oggi', 'volevo', 'andare', "all'ippodromo", 'stasera', 'mangio', 'pizza', 'verdure']
4.1) lemmatizer, num words: 12 for list: [u'ieri', u'andarsene', u'due', u'supermercato', u'oggi', u'volere', u'andare', u"all'ippodromo", u'stasera', u'mangiare', u'pizza', u'verdura']
4.2) stemmer, num words: 12 for list: [u'ier', u'andat', u'due', u'supermerc', u'oggi', u'vol', u'andar', u"all'ippodrom", u'staser', u'mang', u'pizz', u'verdur']
For original word(s) 'andato' and 'andare' the stemmer: 'andat' 'andar' (count 1 each), the lemmatizer: 'andarsene' 'andarsene' (count 2)
How to effectively lemmatize some sentences with pattern using their tokenizer? (assuming lemmas are recognized as nouns/verbs/adjectives etc.)
Is there a python alternative to pattern to use for Italian lemmatization with nltk?
How to split articles that are bound to the next word using apostrophes?

I'll try to answer your question, knowing that I don't know a lot about italian!
1) As far as I know, the main responsibility for removing apostrophe is the tokenizer, and as such the nltk italian tokenizer seems to have failed.
3) A simple thing you can do about it is call the replace method (although you probably will have to use the re package for more complicated pattern), an example:
word_tokenized_no_punct_no_sw_no_apostrophe = [x.split("'") for x in word_tokenized_no_punct_no_sw]
word_tokenized_no_punct_no_sw_no_apostrophe = [y for x in word_tokenized_no_punct_no_sw_no_apostrophe for y in x]
It yields:
['ieri', 'andato', 'due', 'supermercati', 'oggi', 'volevo', 'andare', 'all', 'ippodromo', 'stasera', 'mangio', 'pizza', 'verdure']
2) An alternative to pattern would be treetagger, granted it is not the easiest install of all (you need the python package and the tool itself, however after this part it works on windows and Linux).
A simple example with your example above:
import treetaggerwrapper
from pprint import pprint
it_string = "Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo. Stasera mangio la pizza con le verdure."
tagger = treetaggerwrapper.TreeTagger(TAGLANG="it")
tags = tagger.tag_text(it_string)
pprint(treetaggerwrapper.make_tags(tags))
The pprint yields:
[Tag(word=u'Ieri', pos=u'ADV', lemma=u'ieri'),
Tag(word=u'sono', pos=u'VER:pres', lemma=u'essere'),
Tag(word=u'andato', pos=u'VER:pper', lemma=u'andare'),
Tag(word=u'in', pos=u'PRE', lemma=u'in'),
Tag(word=u'due', pos=u'ADJ', lemma=u'due'),
Tag(word=u'supermercati', pos=u'NOM', lemma=u'supermercato'),
Tag(word=u'.', pos=u'SENT', lemma=u'.'),
Tag(word=u'Oggi', pos=u'ADV', lemma=u'oggi'),
Tag(word=u'volevo', pos=u'VER:impf', lemma=u'volere'),
Tag(word=u'andare', pos=u'VER:infi', lemma=u'andare'),
Tag(word=u"all'", pos=u'PRE:det', lemma=u'al'),
Tag(word=u'ippodromo', pos=u'NOM', lemma=u'ippodromo'),
Tag(word=u'.', pos=u'SENT', lemma=u'.'),
Tag(word=u'Stasera', pos=u'ADV', lemma=u'stasera'),
Tag(word=u'mangio', pos=u'VER:pres', lemma=u'mangiare'),
Tag(word=u'la', pos=u'DET:def', lemma=u'il'),
Tag(word=u'pizza', pos=u'NOM', lemma=u'pizza'),
Tag(word=u'con', pos=u'PRE', lemma=u'con'),
Tag(word=u'le', pos=u'DET:def', lemma=u'il'),
Tag(word=u'verdure', pos=u'NOM', lemma=u'verdura'),
Tag(word=u'.', pos=u'SENT', lemma=u'.')]
It also tokenized pretty nicely the all'ippodromo to al and ippodromo (which is hopefully correct) under the hood before lemmatizing. Now we just need to apply the removal of stop words and punctuation and it will be fine.
The doc for installing the TreeTaggerWrapper library for python

I know this issue has been solved few years ago, but I am facing the same problem with nltk tokenization and Python 3 in regards to parsing words like all'ippodromo or dall'Italia. So I want to share my experience and give a partial, although late, answer.
The first action/rule that an NLP must take into account is to prepare the corpus. So I discovered that by replacing the ' character with a proper accent ’ by using accurate regex replacing during text parsing (or just a propedeutic replace all at once in basic text editor), then the tokenization works correctly and I am having the proper splitting with just nltk.tokenize.word_tokenize(text)

Related

Fastest way to replace phrases from sentences with Python?

I have a list of 3800 names I want to remove from 750K sentences.
The names can contain multiple words such as "The White Stripes".
Some names might also be look like a subset of a larger name, ex: 'Ame' may be one name and 'Amelie' may be another.
This is what my current implementation looks like:
def find_whole_word(w):
return re.compile(r'\b({0})\b'.format(w), flags=re.IGNORECASE).search
names_lowercase = ['the white stripes', 'the beatles', 'slayer', 'ame', 'amelie'] # 3800+ names
def strip_names(sentence: str):
token = sentence.lower()
has_name = False
matches = []
for name in names_lowercase:
match = find_whole_word(name)(token)
if match:
matches.append(match)
def get_match(match):
return match.group(1)
matched_strings = list(map(get_match, matches))
matched_strings.sort(key=len, reverse=True)
for matched_string in matched_strings:
# strip names at the start, end and when they occur in the middle of text (with whitespace around)
token = re.sub(rf"(?<!\S){matched_string}(?!\S)", "", token)
return token
sentences = [
"how now brown cow",
"die hard fan of slayer",
"the white stripes kill",
"besides slayer I believe the white stripes are the best",
"who let ame out",
"amelie has got to go"
] # 750K+ sentences
filtered_list = [strip_names(sentence) for sentence in sentences]
# Expected: filtered_list = ["how now brown cow", "die hard fan of ", " kill", "besides I believe are the best", "who let out", " has got to go"]
My current implementation takes several hours. I don't care about readability as this code won't be used for long.
Any suggestions on how I can increase the run time?
My previous solution was overkill.
All I really had to do was use the word boundary \b as described in the documentation.
Usage example: https://regex101.com/r/2CZ8el/1
import re
names_joined = "|".join(names_lowercase)
names_whole_words_filter_expression = re.compile(rf"\b({names_joined})\b", flags=re.IGNORECASE)
def strip_names(text: str):
return re.sub(names_whole_words_filter_expression, "", text).strip()
Now it takes a few minutes instead of a few hours 🙌

Spacy to Conll format without using Spacy's sentence splitter

This post shows how to get dependencies of a block of text in Conll format with Spacy's taggers. This is the solution posted:
import spacy
nlp_en = spacy.load('en')
doc = nlp_en(u'Bob bought the pizza to Alice')
for sent in doc.sents:
for i, word in enumerate(sent):
if word.head == word:
head_idx = 0
else:
head_idx = word.head.i - sent[0].i + 1
print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
i+1, # There's a word.i attr that's position in *doc*
word,
word.lemma_,
word.tag_, # Fine-grained tag
word.ent_type_,
str(head_idx),
word.dep_ # Relation
))
It outputs this block:
1 Bob bob NNP PERSON 2 nsubj
2 bought buy VBD 0 ROOT
3 the the DT 4 det
4 pizza pizza NN 2 dobj
5 to to IN 2 dative
6 Alice alice NNP PERSON 5 pobj
I would like to get the same output WITHOUT using doc.sents.
Indeed, I have my own sentence-splitter. I would like to use it, and then give Spacy one sentence at a time to get POS, NER, and dependencies.
How can I get POS, NER, and dependencies of one sentence in Conll format with Spacy without having to use Spacy's sentence splitter ?
A Document in sPacy is iterable, and in the documentation is states that it iterates over Tokens
| __iter__(...)
| Iterate over `Token` objects, from which the annotations can be
| easily accessed. This is the main way of accessing `Token` objects,
| which are the main way annotations are accessed from Python. If faster-
| than-Python speeds are required, you can instead access the annotations
| as a numpy array, or access the underlying C data directly from Cython.
|
| EXAMPLE:
| >>> for token in doc
Therefore I believe you would just have to make a Document for each of your sentences that are split, then do something like the following:
def printConll(split_sentence_text):
doc = nlp(split_sentence_text)
for i, word in enumerate(doc):
if word.head == word:
head_idx = 0
else:
head_idx = word.head.i - sent[0].i + 1
print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
i+1, # There's a word.i attr that's position in *doc*
word,
word.lemma_,
word.tag_, # Fine-grained tag
word.ent_type_,
str(head_idx),
word.dep_ # Relation
))
Of course, following the CoNLL format you would have to print a newline after each sentence.
This post is about a user facing unexpected sentence breaks from using the spacy sentence boundary detection. One of the solutions proposed by the developers at Spacy (as on the post) is to add flexibility to add ones own sentence boundary detection rules. This problem is solved in conjunction with dependency parsing by Spacy, not before it. Therefore, I don't think what you're looking for is supported at all by Spacy at the moment, though it might be in the near future.
#ashu 's answer is partly right: dependency parsing and sentence boundary detection are tightly coupled by design in spaCy. Though there is a simple sentencizer.
https://spacy.io/api/sentencizer
It seems the sentecizer just uses punctuation (not the perfect way). But if such sentencizer exists then you can create a custom one using your rules and it will affect sentence boundaries for sure.

Finding Alliterative Word Sequences with Python

I am working in Python 3.6 with NLTK 3.2.
I am trying to write a program which takes raw text as input and outputs any (maximum) series of consecutive words beginning with the same letter (i.e. alliterative sequences).
When searching for sequences, I want to ignore certain words and punctuation (for instance, 'it', 'that', 'into', ''s', ',', and '.'), but to include them in the output.
For example, inputting
"The door was ajar. So it seems that Sam snuck into Sally's subaru."
should yield
["so", "it", "seems", "that", "sam", "snuck", "into", "sally's", "subaru"]
I am new to programming and the best I could come up with is:
import nltk
from nltk import word_tokenize
raw = "The door was ajar. So it seems that Sam snuck into Sally's subaru."
tokened_text = word_tokenize(raw) #word tokenize the raw text with NLTK's word_tokenize() function
tokened_text = [w.lower() for w in tokened_text] #make it lowercase
for w in tokened_text: #for each word of the text
letter = w[0] #consider its first letter
allit_str = []
allit_str.append(w) #add that word to a list
pos = tokened_text.index(w) #let "pos" be the position of the word being considered
for i in range(1,len(tokened_text)-pos): #consider the next word
if tokened_text[pos+i] in {"the","a","an","that","in","on","into","it",".",",","'s"}: #if it's one of these
allit_str.append(tokened_text[pos+i]) #add it to the list
i=+1 #and move on to the next word
elif tokened_text[pos+i][0] == letter: #or else, if the first letter is the same
allit_str.append(tokened_text[pos+i]) #add the word to the list
i=+1 #and move on to the next word
else: #or else, if the letter is different
break #break the for loop
if len(allit_str)>=2: #if the list has two or more members
print(allit_str) #print it
which outputs
['ajar', '.']
['so', 'it', 'seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['snuck', 'into', 'sally', "'s", 'subaru', '.']
['sally', "'s", 'subaru', '.']
['subaru', '.']
This is close to what I want, except that I don't know how to restrict the program to only print the maximum sequences.
So my questions are:
How can I modify this code to only print the maximum sequence
['so', 'it', 'seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']?
Is there an easier way to do this in Python, maybe with regular expression or more elegant code?
Here are similar questions asked elsewhere, but which have not helped me modify my code:
How do you effectively use regular expressions to find alliterative expressions?
A reddit challenge asking for a similar program
4chan question regarding counting instances of alliteration
Blog about finding most common alliterative strings in a corpus
(I also think it would be nice to have this question answered on this site.)
Interesting task. Personally, I'd loop through without the use of indices, keeping track of the previous word to compare it with the current word.
Additionally, it's not enough to compare letters; you have to take into account that 's' and 'sh' etc don't alliterate. Here's my attempt:
import nltk
from nltk import word_tokenize
from nltk import sent_tokenize
from nltk.corpus import stopwords
import string
from collections import defaultdict, OrderedDict
import operator
raw = "The door was ajar. So it seems that Sam snuck into Sally's subaru. She seems shy sometimes. Someone save Simon."
# Get the English alphabet as a list of letters
letters = [letter for letter in string.ascii_lowercase]
# Here we add some extra phonemes that are distinguishable in text.
# ('sailboat' and 'shark' don't alliterate, for instance)
# Digraphs go first as we need to try matching these before the individual letters,
# and break out if found.
sounds = ["ch", "ph", "sh", "th"] + letters
# Use NLTK's built in stopwords and add "'s" to them
stopwords = stopwords.words('english') + ["'s"] # add extra stopwords here
stopwords = set(stopwords) # sets are MUCH faster to process
sents = sent_tokenize(raw)
alliterating_sents = defaultdict(list)
for sent in sents:
tokenized_sent = word_tokenize(sent)
# Create list of alliterating word sequences
alliterating_words = []
previous_initial_sound = ""
for word in tokenized_sent:
for sound in sounds:
if word.lower().startswith(sound): # only lowercasing when comparing retains original case
initial_sound = sound
if initial_sound == previous_initial_sound:
if len(alliterating_words) > 0:
if previous_word == alliterating_words[-1]: # prevents duplication in chains of more than 2 alliterations, but assumes repetition is not alliteration)
alliterating_words.append(word)
else:
alliterating_words.append(previous_word)
alliterating_words.append(word)
else:
alliterating_words.append(previous_word)
alliterating_words.append(word)
break # Allows us to treat sh/s distinctly
# This needs to be at the end of the loop
# It sets us up for the next iteration
if word not in stopwords: # ignores stopwords for the purpose of determining alliteration
previous_initial_sound = initial_sound
previous_word = word
alliterating_sents[len(alliterating_words)].append(sent)
sorted_alliterating_sents = OrderedDict(sorted(alliterating_sents.items(), key=operator.itemgetter(0), reverse=True))
# OUTPUT
print ("A sorted ordered dict of sentences by number of alliterations:")
print (sorted_alliterating_sents)
print ("-" * 15)
max_key = max([k for k in sorted_alliterating_sents]) # to get sent with max alliteration
print ("Sentence(s) with most alliteration:", sorted_alliterating_sents[max_key])
This produces a sorted ordered dictionary of sentences with their alliteration counts as its keys. The max_key variable contains the count for the highest alliterating sentence or sentences, and can be used to access the sentences themselves.
The accepted answer is very comprehensive, but I would suggest using Carnegie Mellon's pronouncing dictionary. This is partly because it makes life easier, and partly because identical sounding syllables that are not necessarily identical letter-to-letter are also considered alliterations. An example I found online (https://examples.yourdictionary.com/alliteration-examples.html) is "Finn fell for Phoebe".
# nltk.download('cmudict') ## download CMUdict for phoneme set
# The phoneme dictionary consists of ARPABET which encode
# vowels, consonants, and a representitive stress-level (wiki/ARPABET)
phoneme_dictionary = nltk.corpus.cmudict.dict()
stress_symbols = ['0', '1', '2', '3...', '-', '!', '+', '/',
'#', ':', ':1', '.', ':2', '?', ':3']
# nltk.download('stopwords') ## download stopwords (the, a, of, ...)
# Get stopwords that will be discarded in comparison
stopwords = nltk.corpus.stopwords.words("english")
# Function for removing all punctuation marks (. , ! * etc.)
no_punct = lambda x: re.sub(r'[^\w\s]', '', x)
def get_phonemes(word):
if word in phoneme_dictionary:
return phoneme_dictionary[word][0] # return first entry by convention
else:
return ["NONE"] # no entries found for input word
def get_alliteration_level(text): # alliteration based on sound, not only letter!
count, total_words = 0, 0
proximity = 2 # max phonemes to compare to for consideration of alliteration
i = 0 # index for placing phonemes into current_phonemes
lines = text.split(sep="\n")
for line in lines:
current_phonemes = [None] * proximity
for word in line.split(sep=" "):
word = no_punct(word) # remove punctuation marks for correct identification
total_words += 1
if word not in stopwords:
if (get_phonemes(word)[0] in current_phonemes): # alliteration occurred
count += 1
current_phonemes[i] = get_phonemes(word)[0] # update new comparison phoneme
i = 0 if i == 1 else 1 # update storage index
alliteration_score = count / total_words
return alliteration_score
Above is the proposed script. The variable proximity is introduced so that we consider syllables in alliteration, that are otherwise separated by multiple words. The stress_symbols variables reflect stress levels indicated on the CMU dictionary, and it could be easily incorporated in to the function.

Find a value based on a matching word and store it

I am new to python, and im just trying to get a feel for the language
I have a file called lion.txt that has this text:
The lion (Panthera leo) is one of the big cats in the genus Panthera and a member of the family Felidae. The commonly used term African lion collectively denotes the several subspecies in Africa. With some males exceeding=250/12 kg (550 lb) in weight,[4].
What I want my program to do is search for the keyword exceeding and write only the value 250 to another file called searched.txt. At very best is it possible to store it as a variable and then print it to another text file?
This is what I have so Far:
import os
import re
os.chdir("C:\Python 2016 Training\lionfolder")
f = open("lion.txt", "r")
w = open("searched.txt", "w")
k = [] #Figured a dictionary would be the best way to deal with this?
for line in f:
if re.match('(.*)exceeding(.*)', line):
w.write(k[1] = "line")
Is what im asking to do even possible with Python?
Thank you in advance
Regards,
Kevin.
Not bad for a first attempt. You're close to a working solution, but missing some critical parts. Try this:
import os
import re
f = open('lion.txt', 'r')
w = open('searched.txt', 'w')
for line in f:
match = re.search('exceeding\=(\d+)', line)
if match:
w.write(match.group(1))
w.close()
f.close()
There are better ways of doing this, but I have tried to stay as close to your original code as possible, so you don't get lost.

How to iterate a python list and compare items in a string or another list

Following my earlier question, I have tried to work on a code to return a string if a search term in a certain list is in a string to be returned as follows.
import re
from nltk import tokenize
from nltk.tokenize import sent_tokenize
def foo():
List1 = ['risk','cancer','ocp','hormone','OCP',]
txt = "Risk factors for breast cancer have been well characterized. Breast cancer is 100 times more frequent in women than in men.\
Factors associated with an increased exposure to estrogen have also been elucidated including early menarche, late menopause, later age\
at first pregnancy, or nulliparity. The use of hormone replacement therapy has been confirmed as a risk factor, although mostly limited to \
the combined use of estrogen and progesterone, as demonstrated in the WHI (2). Analysis showed that the risk of breast cancer among women using \
estrogen and progesterone was increased by 24% compared to placebo. A separate arm of the WHI randomized women with a prior hysterectomy to \
conjugated equine estrogen (CEE) versus placebo, and in that study, the use of CEE was not associated with an increased risk of breast cancer (3).\
Unlike hormone replacement therapy, there is no evidence that oral contraceptive (OCP) use increases risk. A large population-based case-control study \
examining the risk of breast cancer among women who previously used or were currently using OCPs included over 9,000 women aged 35 to 64 \
(half of whom had breast cancer) (4). The reported relative risk was 1.0 (95% CI, 0.8 to 1.3) among women currently using OCPs and 0.9 \
(95% CI, 0.8 to 1.0) among prior users. In addition, neither race nor family history was associated with a greater risk of breast cancer among OCP users."
words = txt
corpus = " ".join(words).lower()
sentences1 = sent_tokenize(corpus)
a = [" ".join([sentences1[i-1],j]) for i,j in enumerate(sentences1) if [item in List1] in word_tokenize(j)]
for i in a:
print i,'\n','\n'
foo()
The problem is that the python IDLE does not print anything. What could I have done wrong. What it does is run the code and I get this
>
>
Your question isn't very clear to me so please correct me if i'm getting this wrongly. Are you trying to match the list of keywords (in list1) against the text (in txt)? That is,
For each keyword in list1
Do a match against every sentences in txt.
Print the sentence if they matches?
Instead of writing a complicated regular expression to solve your problem I have broken it down into 2 parts.
First I break the whole lot of text into a list of sentences. Then write simple regular expression to go through every sentences. Trouble with this approach is that it is not very efficient but hey it solves your problem.
Hope this small chunk of code can help guide you to the real solution.
def foo():
List1 = ['risk','cancer','ocp','hormone','OCP',]
txt = "blah blah blah - truncated"
words = txt
matches = []
sentences = re.split(r'\.', txt)
keyword = List1[0]
pattern = keyword
re.compile(pattern)
for sentence in sentences:
if re.search(pattern, sentence):
matches.append(sentence)
print("Sentence matching the word (" + keyword + "):")
for match in matches:
print (match)
--------- Generate random number -----
from random import randint
List1 = ['risk','cancer','ocp','hormone','OCP',]
print(randint(0, len(List1) - 1)) # gives u random index - use index to access List1