NLTK package to estimate the (unigram) perplexity - python-2.7

I am trying to calculate the perplexity for the data I have. The code I am using is:
import sys
sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk")
from nltk.corpus import brown
from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist, WittenBellProbDist
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(3, brown.words(categories='news'), True, False, estimator)
print lm
But I am receiving the error,
File "/usr/local/anaconda/lib/python2.7/site-packages/nltk/model/ngram.py", line 107, in __init__
cfd[context][token] += 1
TypeError: 'int' object has no attribute '__getitem__'
I have already performed Latent Dirichlet Allocation for the data I have and I have generated the unigrams and their respective probabilities (they are normalized as the sum of total probabilities of the data is 1).
My unigrams and their probability looks like:
Negroponte 1.22948976891e-05
Andreas 7.11290670484e-07
Rheinberg 7.08255885794e-07
Joji 4.48481435106e-07
Helguson 1.89936727391e-07
CAPTION_spot 2.37395965468e-06
Mortimer 1.48540253778e-07
yellow 1.26582575863e-05
Sugar 1.49563800878e-06
four 0.000207196011781
This is just a fragment of the unigrams file I have. The same format is followed for about 1000s of lines. The total probabilities (second column) summed gives 1.
I am a budding programmer. This ngram.py belongs to the nltk package and I am confused as to how to rectify this. The sample code I have here is from the nltk documentation and I don't know what to do now. Please help on what I can do. Thanks in advance!

Perplexity is the inverse probability of the test set, normalized by the number of words. In the case of unigrams:
Now you say you have already constructed the unigram model, meaning, for each word you have the relevant probability. Then you only need to apply the formula. I assume you have a big dictionary unigram[word] that would provide the probability of each word in the corpus. You also need to have a test set. If your unigram model is not in the form of a dictionary, tell me what data structure you have used, so I could adapt it to my solution accordingly.
perplexity = 1
N = 0
for word in testset:
if word in unigram:
N += 1
perplexity = perplexity * (1/unigram[word])
perplexity = pow(perplexity, 1/float(N))
UPDATE:
As you asked for a complete working example, here's a very simple one.
Suppose this is our corpus:
corpus ="""
Monty Python (sometimes known as The Pythons) were a British surreal comedy group who created the sketch comedy show Monty Python's Flying Circus,
that first aired on the BBC on October 5, 1969. Forty-five episodes were made over four series. The Python phenomenon developed from the television series
into something larger in scope and impact, spawning touring stage shows, films, numerous albums, several books, and a stage musical.
The group's influence on comedy has been compared to The Beatles' influence on music."""
Here's how we construct the unigram model first:
import collections, nltk
# we first tokenize the text corpus
tokens = nltk.word_tokenize(corpus)
#here you construct the unigram language model
def unigram(tokens):
model = collections.defaultdict(lambda: 0.01)
for f in tokens:
try:
model[f] += 1
except KeyError:
model [f] = 1
continue
N = float(sum(model.values()))
for word in model:
model[word] = model[word]/N
return model
Our model here is smoothed. For words outside the scope of its knowledge, it assigns a low probability of 0.01. I already told you how to compute perplexity:
#computes perplexity of the unigram model on a testset
def perplexity(testset, model):
testset = testset.split()
perplexity = 1
N = 0
for word in testset:
N += 1
perplexity = perplexity * (1/model[word])
perplexity = pow(perplexity, 1/float(N))
return perplexity
Now we can test this on two different test sets:
testset1 = "Monty"
testset2 = "abracadabra gobbledygook rubbish"
model = unigram(tokens)
print perplexity(testset1, model)
print perplexity(testset2, model)
for which you get the following result:
>>>
49.09452736318415
99.99999999999997
Note that when dealing with perplexity, we try to reduce it. A language model that has less perplexity with regards to a certain test set is more desirable than one with a bigger perplexity. In the first test set, the word Monty was included in the unigram model, so the respective number for perplexity was also smaller.

Thanks for the code snippet! Shouldn't:
for word in model:
model[word] = model[word]/float(sum(model.values()))
be rather:
v = float(sum(model.values()))
for word in model:
model[word] = model[word]/v
Oh ... I see was already answered ...

Related

Time variable units "day as %Y%m%d.%f" in python iris

I am hoping someone can help. I am running a few climate models (NetCDF files) in python using iris. All was working well until I added my last model which is formatted differently. The units they use for the time variable in the new models is day as %Y%m%d.%f but in the other models it is days since …. This means that when I try to constrain the time variable I get the following error AttributeError: 'numpy.float64' object has no attribute 'year'.
I tried adding a year variable using iriscc.add_year(EARTH3, 'time') but that just brings up the error ‘Unit has undefined calendar’.
I’m wondering if you know how I might fix this? Do I need to convert the calendar type? Or is there is there a way around that? Not sure how to do that anyway!
Thank you!
Erika
EDIT: here is the full code for my file the model CanESM2 is working, but the model EARTH3 is not - it is the one with the funny time units.
import matplotlib.pyplot as plt
import iris
import iris.coord_categorisation as iriscc
import iris.plot as iplt
import iris.quickplot as qplt
import iris.analysis.cartography
import cf_units
from cf_units import Unit
import datetime
import numpy as np
def main():
#-------------------------------------------------------------------------
#bring in all the GCM models we need and give them a name
CanESM2= '/exports/csce/datastore/geos/users/s0XXXX/Climate_Modelling/GCM_data/tasmin_Amon_CanESM2_historical_r1i1p1_185001-200512.nc'
EARTH3= '/exports/csce/datastore/geos/users/s0XXXX/Climate_Modelling/GCM_data/tas_Amon_EC-EARTH_historical_r3i1p1_1850-2009.nc'
#Load exactly one cube from given file
CanESM2 = iris.load_cube(CanESM2)
EARTH3 = iris.load_cube(EARTH3)
print"CanESM2 time"
print (CanESM2.coord('time'))
print "EARTH3 time"
print (EARTH3.coord('time'))
#fix EARTH3 time units as they differ from all other models
t_coord=EARTH3.coord('time')
t_unit = t_coord.attributes['invalid_units']
timestep, _, t_fmt_str = t_unit.split(' ')
new_t_unit_str= '{} since 1850-01-01 00:00:00'.format(timestep)
new_t_unit = cf_units.Unit(new_t_unit_str, calendar=cf_units.CALENDAR_STANDARD)
new_datetimes = [datetime.datetime.strptime(str(dt), t_fmt_str) for dt in t_coord.points]
new_dt_points = [new_t_unit.date2num(new_dt) for new_dt in new_datetimes]
new_t_coord = iris.coords.DimCoord(new_dt_points, standard_name='time', units=new_t_unit)
print "EARTH3 new time"
print (EARTH3.coord('time'))
#regrid all models to have same latitude and longitude system, all regridded to model with lowest resolution
CanESM2 = CanESM2.regrid(CanESM2, iris.analysis.Linear())
EARTH3 =EARTH3.regrid(CanESM2, iris.analysis.Linear())
#we are only interested in the latitude and longitude relevant to Malawi (has to be slightly larger than country boundary to take into account resolution of GCMs)
Malawi = iris.Constraint(longitude=lambda v: 32.0 <= v <= 36., latitude=lambda v: -17. <= v <= -8.)
CanESM2 =CanESM2.extract(Malawi)
EARTH3 =EARTH3.extract(Malawi)
#time constraignt to make all series the same, for ERAINT this is 1990-2008 and for RCMs and GCMs this is 1961-2005
iris.FUTURE.cell_datetime_objects = True
t_constraint = iris.Constraint(time=lambda cell: 1961 <= cell.point.year <= 2005)
CanESM2 =CanESM2.extract(t_constraint)
EARTH3 =EARTH3.extract(t_constraint)
#Convert units to match, CORDEX data is in Kelvin but Observed data in Celsius, we would like to show all data in Celsius
CanESM2.convert_units('Celsius')
EARTH3.units = Unit('Celsius') #this fixes EARTH3 which has no units defined
EARTH3=EARTH3-273 #this converts the data manually from Kelvin to Celsius
#add year data to files
iriscc.add_year(CanESM2, 'time')
iriscc.add_year(EARTH3, 'time')
#We are interested in plotting the data by year, so we need to take a mean of all the data by year
CanESM2YR=CanESM2.aggregated_by('year', iris.analysis.MEAN)
EARTH3YR = EARTH3.aggregated_by('year', iris.analysis.MEAN)
#Returns an array of area weights, with the same dimensions as the cube
CanESM2YR_grid_areas = iris.analysis.cartography.area_weights(CanESM2YR)
EARTH3YR_grid_areas = iris.analysis.cartography.area_weights(EARTH3YR)
#We want to plot the mean for the whole region so we need a mean of all the lats and lons
CanESM2YR_mean = CanESM2YR.collapsed(['latitude', 'longitude'], iris.analysis.MEAN, weights=CanESM2YR_grid_areas)
EARTH3YR_mean = EARTH3YR.collapsed(['latitude', 'longitude'], iris.analysis.MEAN, weights=EARTH3YR_grid_areas)
#-------------------------------------------------------------------------
#PART 4: PLOT LINE GRAPH
#limit x axis
plt.xlim((1961,2005))
#assign the line colours and set x axis to 'year' rather than 'time'
qplt.plot(CanESM2YR_mean.coord('year'), CanESM2YR_mean, label='CanESM2', lw=1.5, color='blue')
qplt.plot(EARTH3YR_mean.coord('year'), EARTH3YR_mean, label='EC-EARTH (r3i1p1', lw=1.5, color='magenta')
#set a title for the y axis
plt.ylabel('Near-Surface Temperature (degrees Celsius)')
#create a legend and set its location to under the graph
plt.legend(loc="upper center", bbox_to_anchor=(0.5,-0.05), fancybox=True, shadow=True, ncol=2)
#create a title
plt.title('Tas for Malawi 1961-2005', fontsize=11)
#add grid lines
plt.grid()
#show the graph in the console
iplt.show()
if __name__ == '__main__':
main()
In Iris, unit strings for time coordinates must be specified in the format <time-period> since <epoch>, where <time-period> is a unit of measure of time, such as 'days', or 'years'. This format is specified by udunits2, the library Iris uses to supply valid units and perform unit conversions.
The time coordinate in this case does not have a unit that follows this format, meaning the time coordinate will not have full time coordinate functionality (this partly explains the Attribute Error in the question). To fix this we will need to construct a new time coordinate based on the values and metadata of the existing time coordinate and then replace the cube's existing time coordinate with the new one.
To do this we'll need to:
construct a new time unit based on the metadata contained in the existing time unit
take the existing time coordinate's point values and format them as datetime objects, using the format string specified in the existing time unit
convert the datetime objects from (2.) to an array of floating-point numbers using the new time unit constructed in (1.)
create a new time coordinate from the array constructed in (3.) and the new time unit produced in (1.)
remove the old time coordinate from the cube and add the new one.
Here's the code to do this...
import datetime
import cf_units
import iris
import numpy as np
t_coord = EARTH3.coord('time')
t_unit = t_coord.attributes['invalid_units']
timestep, _, t_fmt_str = t_unit.split(' ')
new_t_unit_str = '{} since 1850-01-01 00:00:00'.format(timestep)
new_t_unit = cf_units.Unit(new_t_unit_str, calendar=cf_units.CALENDAR_STANDARD)
new_datetimes = [datetime.datetime.strptime(str(dt), t_fmt_str) for dt in t_coord.points]
new_dt_points = [new_t_unit.date2num(new_dt) for new_dt in new_datetimes]
new_t_coord = iris.coords.DimCoord(new_dt_points, standard_name='time', units=new_t_unit)
t_coord_dim = cube.coord_dims('time')
cube.remove_coord('time')
cube.add_dim_coord(new_t_coord, t_coord_dim)
I've made an assumption about the best epoch for your time data. I've also made an assumption about the calendar that best describes your data, but you should be able to replace (when constructing new_t_unit) the standard calendar I've chosen with any other valid cf_units calendar without difficulty.
As a final note, it is effectively impossible to change calendar types. This is because different calendar types include and exclude different days. For example, a 360day calendar has a Feb 30 but no May 31 (as it assumes 12 idealised 30 day long months). If you try and convert from a 360day calendar to a standard calendar, problems you hit include what you do with the data from 29 and 30 Feb, and how you fill the five missing days that don't exist in a 360day calendar. For such reasons it's generally impossible to convert calendars (and Iris doesn't allow such operations).
Hope this helps!
Maybe the answer is not more useful however I write here the function that I made in order to convert the data from %Y%m%d.%f in datetime array.
The function create a perfect datetime array, without missing values, it can be modified to take into account if there are missing times, however a climate model should not have missing data.
def fromEARTHtime2Datetime(dt,timeVecEARTH):
"""
This function returns the perfect array from the EARTH %Y%m%d.%f time
format and convert it to a more useful time, such as the time array
from the datetime of pyhton, this is WHTOUT any missing data!
Parameters
----------
dt : string
This is the time discretization, it can be 1h or 6h, but always it
needs to be hours, example dt = '6h'.
timeVecEARTH : array of float
Vector of the time to be converted. For example the time of the
EARTH is day as %Y%m%d.%f.
And only this format can be converted to datetime, for example:
20490128.0,20490128.25,20490128.5,20490128.75 this will be converted
in datetime: '2049-01-28 00:00:00', '2049-01-28 60:00:00',
'2049-01-28 12:00:00','2049-01-28 18:00:00'
Returns
-------
timeArrNew : datetime
This is the perfect and WITHOUT any missing data datatime array,
for example: DatetimeIndex(['2049-01-28 00:00:00', '2049-01-28 06:00:00',
...
'2049-02-28 18:00:00', '2049-03-01 00:00:00'],
dtype='datetime64[ns]', length=129, freq='6H')
"""
dtDay = 24/np.float(dt[:-1])
partOfDay = np.arange(0,1,1/dtDay)
hDay = []
for ip in partOfDay:
hDay.append('%02.f:00:00' %(24*ip))
dictHours = dict(zip(partOfDay,hDay))
t0Str = str(timeVecEARTH[0])
timeAux0 = t0Str.split('.')
timeAux0 = timeAux0[0][0:4] +'-' + timeAux0[0][4:6] +'-' + timeAux0[0][6:] + ' ' + dictHours[float(timeAux0[1])]
tendStr = str(timeVecEARTH[-1])
timeAuxEnd = tendStr.split('.')
timeAuxEnd = timeAuxEnd[0][0:4] +'-' + timeAuxEnd[0][4:6] +'-' + timeAuxEnd[0][6:] + ' ' + dictHours[float(timeAuxEnd[1])]
timeArrNew = pd.date_range(timeAux0,timeAuxEnd, freq=dt)
return timeArrNew

PVLIB - DC Power From Irradiation - Simple Calculation

Dear pvlib users and devels.
I'm a researcher in computer science, not particularly expert in the simulation or modelling of solar panels. I'm interested in use pvlib since
we are trying to simulate the works of a small solar panel used for IoT
applications, in particular the panel spec are the following:
12.8% max efficiency, Vmp = 5.82V, size = 225 × 155 × 17 mm.
Before using pvlib, one of my collaborator wrote a code that compute the
irradiation directly from average monthly values calculated with PVWatt.
I was not really satisfied, so we are starting to use pvlib.
In the old code, we have the power and current of the panel calculated as:
W = Irradiation * PanelSize(m^2) * Efficiency
A = W / Vmp
The Irradiation, in Madrid, as been obtained with PVWatt, and this is
what my collaborator used:
DIrradiance = (2030.0,2960.0,4290.0,5110.0,5950.0,7090.0,7200.0,6340.0,4870.0,3130.0,2130.0,1700.0)
I'm trying to understand if pvlib compute values similar to the ones above, as averages over a day for each month. And the curve of production in day.
I wrote this to compare pvlib with our old model:
import math
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import pandas as pd
import pvlib
from pvlib.location import Location
def irradiance(day,m):
DIrradiance =(2030.0,2960.0,4290.0,5110.0,5950.0,
7090.0,7200.0,6340.0,4870.0,3130.0,2130.0,1700.0)
madrid = Location(40.42, -3.70, 'Europe/Madrid', 600, 'Madrid')
times = pd.date_range(start=dt.datetime(2015,m,day,00,00),
end=dt.datetime(2015,m,day,23,59),
freq='60min')
spaout = pvlib.solarposition.spa_python(times, madrid.latitude, madrid.longitude)
spaout = spaout.assign(cosz=pd.Series(np.cos(np.deg2rad(spaout['zenith']))))
z = np.array(spaout['cosz'])
return z.clip(0)*(DIrradiance[m-1])
madrid = Location(40.42, -3.70, 'Europe/Madrid', 600, 'Madrid')
times = pd.date_range(start = dt.datetime(2015,8,15,00,00),
end = dt.datetime(2015,8,15,23,59),
freq='60min')
old = irradiance(15,8) # old model
new = madrid.get_clearsky(times) # pvlib irradiance
plt.plot(old,'r-') # compare them.
plt.plot(old/6.0,'y-') # old seems 6 times more..I do not know why
plt.plot(new['ghi'].values,'b-')
plt.show()
The code above compute the old irradiance, using the zenit angle. and compute the ghi values using the clear_sky. I do not understand if the values in ghi must be multiplied by the cos of zenit too, or not. Anyway
they are smaller by a factor of 6. What I'd like to have at the end is the
power and current in output from the panel (DC) without any inverter, and
we are not really interested at modelling it exactly, but at least, to
have a reasonable curve. We are able to capture from the panel the ampere
produced, and we want to compare the values from the measurements putting
the panel on the roof top with the values calculated by pvlib.
Any help on this would be really appreachiated. Thanks
Sorry Will I do not care a lot about my previous model since I'd like to move all code to pvlib. I followed your suggestion and I'm using irradiance.total_irrad, the code now looks in this way:
madrid = Location(40.42, -3.70, 'Europe/Madrid', 600, 'Madrid')
times = pd.date_range(start=dt.datetime(2015,1,1,00,00),
end=dt.datetime(2015,1,1,23,59),
freq='60min')
ephem_data = pvlib.solarposition.spa_python(times, madrid.latitude,
madrid.longitude)
irrad_data = madrid.get_clearsky(times)
AM = atmosphere.relativeairmass(ephem_data['apparent_zenith'])
total = irradiance.total_irrad(40, 180,
ephem_data['apparent_zenith'], ephem_data['azimuth'],
dni=irrad_data['dni'], ghi=irrad_data['ghi'],
dhi=irrad_data['dhi'], airmass=AM,
surface_type='urban')
poa = total['poa_global'].values
Now, I know the irradiance on POA, and I want to compute the output in Ampere: It is just
(poa*PANEL_EFFICIENCY*AREA) / VOLT_OUTPUT ?
It's not clear to me how you arrived at your values for DIrradiance or what the units are, so I can't comment much the discrepancies between the values. I'm guessing that it's some kind of monthly data since there are 12 values. If so, you'd need to calculate ~hourly pvlib irradiance data and then integrate it to check for consistency.
If your module will be tilted, you'll need to convert your ~hourly irradiance GHI, DNI, DHI values to plane of array (POA) irradiance using a transposition model. The irradiance.total_irrad function is the easiest way to do that.
The next steps depend on the IV characteristics of your module, the rest of the circuit, and how accurate you need the model to be.

Cosine similarity between any two sentences is giving 0.99 always

I downloaded the stackoverflow dump (which is a 10GB file) and ran word2vec on the dump in order to get vector representations for programming terms (I require it for a project that I'm doing). Following is the code:
from gensim.models import Word2Vec
from xml.dom.minidom import parse, parseString
titles, bodies = [], []
xmldoc = parse('test.xml') //this is the dump
reflist = xmldoc.getElementsByTagName('row')
for i in range(len(reflist)):
bitref = reflist[i]
if 'Title' in bitref.attributes.keys():
title = bitref.attributes['Title'].value
titles.append([i for i in title.split()])
if 'Body' in bitref.attributes.keys():
body = bitref.attributes['Body'].value
bodies.append([i for i in body.split()])
dimension = 8
sentences = titles + bodies
model = Word2Vec(sentences, size=dimension, iter=100)
model.save('snippet_1.model')
Now, in order to calculate the cosine similarity between a pair of sentences, I do the following:
from gensim.models import Word2Vec
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
model = Word2Vec.load('snippet_1.model')
dimension = 8
snippet = 'some text'
snippet_vector = np.zeros((1, dimension))
for word in snippet:
if word in model.wv.vocab:
vecvalue = model[word].reshape(1, dimension)
snippet_vector = np.add(snippet_vector, vecvalue)
link_text = 'some other text'
link_vector = np.zeros((1, dimension))
for word in link_text:
if word in model.wv.vocab:
vecvalue = model[word].reshape(1, dimension)
link_vector = np.add(link_vector, vecvalue)
print(cosine_similarity(snippet_vector, link_vector))
I am calculating the sum of word embedding for each word of a sentence to get some representation for the sentence as a whole. I do this for both sentences and then calculate the cosine similarity between them.
Now, the problem is I'm getting cosine similarity around 0.99 for any pair of sentences that I give. Is there anything that I'm doing wrong? Any suggestions for a better approach?
Are you checking that your snippet_vector and link_vector are meaningful vectors before calculating their cosine-similarity?
I suspect they're just zero-vectors, or similarly non-diverse, since your for word in snippet: and for word in link_text: loops aren't tokenizing the text. So they'll just loop over the characters in each string, which either won't be present in your model as words, or the few available may match exactly between your texts. (Even with tokenization, the texts' summed vectors would only differ by the value of a vector for the one different word, 'other'.)

How is TF calculated in Sklearn

I have been experimenting with sklearn's Tfidfvectorizer.
I am only concerned with TF, and not idf, so my settings have use_idf = FALSE
Complete settings are:
vectorizer = TfidfVectorizer(max_df=0.5, max_features= n_features,
ngram_range=(1,3), use_idf=False)
I have been trying to replicate the output of .fit_transform but haven't managed to do it so far and was hoping someone could explain the calculations for me.
My toy example is:
document = ["one two three one four five",
"two six eight ten two"]
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
n_features = 5
vectorizer = TfidfVectorizer(max_df=0.5, max_features= n_features,
ngram_range=(1,3), use_idf=False)
X = vectorizer.fit_transform(document)
count = CountVectorizer(max_df=0.5, max_features= n_features,
ngram_range=(1,3))
countMat = count.fit_transform(document)
I have assumed the counts from the Count Vectorizer will be the same as the counts used int he Tfidf Vectorizer. So am trying to change the countMat object to match X.
I had missed a line from the documentation which says
Each row is normalized to have unit euclidean norm
So to anwer my own question - the answer is:
for i in xrange(countMat.toarray().__len__()):
row = countMat.toarray()[i]
row / np.sqrt(np.sum(row**2))
Although I am sure there is a more elegant way to code the result.

How to iterate a python list and compare items in a string or another list

Following my earlier question, I have tried to work on a code to return a string if a search term in a certain list is in a string to be returned as follows.
import re
from nltk import tokenize
from nltk.tokenize import sent_tokenize
def foo():
List1 = ['risk','cancer','ocp','hormone','OCP',]
txt = "Risk factors for breast cancer have been well characterized. Breast cancer is 100 times more frequent in women than in men.\
Factors associated with an increased exposure to estrogen have also been elucidated including early menarche, late menopause, later age\
at first pregnancy, or nulliparity. The use of hormone replacement therapy has been confirmed as a risk factor, although mostly limited to \
the combined use of estrogen and progesterone, as demonstrated in the WHI (2). Analysis showed that the risk of breast cancer among women using \
estrogen and progesterone was increased by 24% compared to placebo. A separate arm of the WHI randomized women with a prior hysterectomy to \
conjugated equine estrogen (CEE) versus placebo, and in that study, the use of CEE was not associated with an increased risk of breast cancer (3).\
Unlike hormone replacement therapy, there is no evidence that oral contraceptive (OCP) use increases risk. A large population-based case-control study \
examining the risk of breast cancer among women who previously used or were currently using OCPs included over 9,000 women aged 35 to 64 \
(half of whom had breast cancer) (4). The reported relative risk was 1.0 (95% CI, 0.8 to 1.3) among women currently using OCPs and 0.9 \
(95% CI, 0.8 to 1.0) among prior users. In addition, neither race nor family history was associated with a greater risk of breast cancer among OCP users."
words = txt
corpus = " ".join(words).lower()
sentences1 = sent_tokenize(corpus)
a = [" ".join([sentences1[i-1],j]) for i,j in enumerate(sentences1) if [item in List1] in word_tokenize(j)]
for i in a:
print i,'\n','\n'
foo()
The problem is that the python IDLE does not print anything. What could I have done wrong. What it does is run the code and I get this
>
>
Your question isn't very clear to me so please correct me if i'm getting this wrongly. Are you trying to match the list of keywords (in list1) against the text (in txt)? That is,
For each keyword in list1
Do a match against every sentences in txt.
Print the sentence if they matches?
Instead of writing a complicated regular expression to solve your problem I have broken it down into 2 parts.
First I break the whole lot of text into a list of sentences. Then write simple regular expression to go through every sentences. Trouble with this approach is that it is not very efficient but hey it solves your problem.
Hope this small chunk of code can help guide you to the real solution.
def foo():
List1 = ['risk','cancer','ocp','hormone','OCP',]
txt = "blah blah blah - truncated"
words = txt
matches = []
sentences = re.split(r'\.', txt)
keyword = List1[0]
pattern = keyword
re.compile(pattern)
for sentence in sentences:
if re.search(pattern, sentence):
matches.append(sentence)
print("Sentence matching the word (" + keyword + "):")
for match in matches:
print (match)
--------- Generate random number -----
from random import randint
List1 = ['risk','cancer','ocp','hormone','OCP',]
print(randint(0, len(List1) - 1)) # gives u random index - use index to access List1