Use Python to extract three sentences based on word finding - regex

I'm working on a text-mining use case in python. These are the sentences of interest:
As a result may continue to be adversely impacted, by fluctuations in foreign currency exchange rates. Certain events such as the threat of additional tariffs on imported consumer goods from China, have increased. Stores are primarily located in shopping malls and other shopping centers.
How can I extract the sentence with the keyword "China"? I do need a sentence before and after that, actually atleast two sentences before and after.
I've tried the below, as was answered here:
import nltk
from nltk.tokenize import word_tokenize
sents = nltk.sent_tokenize(text)
my_sentences = [sent for sent in sents if 'China' in word_tokenize(sent)]
Please help!

TL;DR
Use sent_tokenize, keep track of the index where the focus word and window the sentences to get the desired result.
from itertools import chain
from nltk import sent_tokenize, word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
word_detokenize = TreebankWordDetokenizer().detokenize
text = """As a result may continue to be adversely impacted, by fluctuations in foreign currency exchange rates. Certain events such as the threat of additional tariffs on imported consumer goods from China, have increased global economic and political uncertainty and caused volatility in foreign currency exchange rates. Stores are primarily located in shopping malls and other shopping centers, certain of which have been experiencing declines in customer traffic."""
tokenized_text = [word_tokenize(sent) for sent in sent_tokenize(text)]
sent_idx_with_china = [idx for idx, sent in enumerate(tokenized_text)
if 'China' in sent or 'china' in sent]
window = 2 # If you want 2 sentences before and after.
for idx in sent_idx_with_china:
start = max(idx - window, 0)
end = min(idx+window, len(tokenized_text))
result = ' '.join(word_detokenize(sent) for sent in tokenized_text[start:end])
print(result)
Another example, pip install wikipedia first:
from itertools import chain
from nltk import sent_tokenize, word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
word_detokenize = TreebankWordDetokenizer().detokenize
import wikipedia
text = wikipedia.page("Winnie The Pooh").content
tokenized_text = [word_tokenize(sent) for sent in sent_tokenize(text)]
sent_idx_with_china = [idx for idx, sent in enumerate(tokenized_text)
if 'China' in sent or 'china' in sent]
window = 2 # If you want 2 sentences before and after.
for idx in sent_idx_with_china:
start = max(idx - window, 0)
end = min(idx+window, len(tokenized_text))
result = ' '.join(word_detokenize(sent) for sent in tokenized_text[start:end])
print(result)
print()
[out]:
Ashdown Forest in England where the Pooh stories are set is a popular
tourist attraction, and includes the wooden Pooh Bridge where Pooh and
Piglet invented Poohsticks. The Oxford University Winnie the Pooh
Society was founded by undergraduates in 1982. == Censorship in China
== In the People's Republic of China, images of Pooh were censored in mid-2017 from social media websites, when internet memes comparing
Chinese president Xi Jinping to Pooh became popular. The 2018 film
Christopher Robin was also denied a Chinese release.

Related

NLTK regex parser's output has changed. Unable to parse phrases like verb followed by a noun

I have written a piece of code to parse the action items from a troubleshooting doc.
I want to extract phrases that start with a verb and end with a noun.
It was working as expected earlier (a month ago). But on running against the same input as earlier, its missing some action items that it was catching previously.
I haven't changed the code. Has something changed from nltk or punkt side that may be affecting my results?
Please help me figure what needs to be changed to make it run as earlier.
import re
import nltk
from nltk.tokenize import PunktSentenceTokenizer
from nltk.tokenize import word_tokenize
#One time downloads
#nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')
#nltk.download('wordnet')
custom_sent_tokenizer = PunktSentenceTokenizer()
def process_content(x):
try:
#sent_tag = []
act_item = []
for i in x:
print('tokenized = ',i)
words = nltk.word_tokenize(i)
print(words)
tagged = nltk.pos_tag(words)
print('tagged = ',tagged)
#sent_tag.append(tagged)
#print('sent= ',sent_tag)
#chunking
chunkGram = r"""ActionItems: {<VB.>+<JJ.|CD|VB.|,|CC|NN.|IN|DT>*<NN|NN.>+}"""
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
print(chunked)
for subtree in chunked.subtrees(filter=lambda t: t.label() == 'ActionItems'):
print('Filtered chunks= ',subtree)
ActionItems = ' '.join([w for w, t in subtree.leaves()])
act_item.append(ActionItems)
chunked.draw()
return act_item
except Exception as e:
#print(str(e))
return str(e)
res = 'replaced rev 6 aeb with a rev 7 aeb. configured new board and regained activity. tuned, flooded and calibrated camera. scanned fi rst patient with no issues. made new backups. replaced aeb board and completed setup. however, det 2 st ill not showing any counts. performed all necessary tests and the y passed . worked with tech support to try and resolve the issue. we decided to order another board due to lower rev received. camera is st ill down.'
tokenized = custom_sent_tokenizer.tokenize(res)
tag = process_content(tokenized)
With the input as shared in the code, earlier, the following action items were being parsed:
['replaced rev 6 aeb', 'configured new board', 'regained activity', 'tuned , flooded and calibrated camera', 'scanned fi rst patient', 'made new backups', 'replaced aeb board', 'completed setup', 'det 2 st ill', 'showing any counts', 'performed all necessary tests and the y', 'worked with tech support']
But now, only these are coming up:
['regained activity', 'tuned , flooded and calibrated camera', 'completed setup', 'det 2 st ill', 'showing any counts']
I finally resolved this by replacing JJ. with JJ|JJR|JJS
So my chunk is defined as :
chunkGram = r"""ActionItems: {<VB.>+<JJ|JJR|JJS|CD|NN.|CC|IN|VB.|,|DT>*<NN|NN.>+}"""
I dont understand this change in behavior.
Dot (.) was a really good way of using all modifiers on a POS

Why is my regex displaying in alphabetical order?

For my assignment, I am trying to scrape information off the following website: https://www.blueroomcinebar.com/movies/now-showing/.
My code needs to find movie names, times and posters. Both the movie times and posters appear to be displayed in the list I have created according to the order they appear in the HTML, however, the names seem to be in alphabetical order.
We are not allowed to use BeautifulSoup
This is my current code for scraping movies:
from re import findall, finditer, MULTILINE, DOTALL
from urllib.request import urlopen
movies_name = []
movies_times = []
movies_image = []
movies_list = []
movies_page = urlopen("https://www.blueroomcinebar.com/movies/now-showing/").read().decode('utf-8')
#Add movies to Movies at Blue Room Screen
find_movie_names = findall(r'<h1>(.*?)</h1>', movies_page)
find_movie_times = findall(r'<p>([0-9]{1,2}:[0-9]{2} AM|PM)</p>', movies_page)
find_movie_image = findall(r'<div class="poster" style="background-image: url\((.*?)\)">', movies_page)
print(find_movie_names)
#Add movies to arrays
for movie in find_movie_names:
movies_name.append(movie)
for movie in find_movie_times:
movies_times.append(movie)
for movie in find_movie_image:
movies_image.append(movie)
print(movies_name)
print(movies_image)
for movie in range(len(movies_name)):
movies_list.append("{};{};{}".format(movies_name[movie], movies_times[movie], movies_image[movie - 1]))
Currently, the names are in the list in the order of
['Aladdin', 'Avengers: Endgame', 'Chandigarh Amritsar Chandigarh', 'John Wick - Parabellum', 'Long Shot', 'Pokemon Detective Pikachu', 'Poms', 'The Hustle', 'Top End Wedding']
They should be in the order:
['Avengers: Endgame', 'Long Shot', 'Pokemon Detective Pikachu', 'The Hustle', 'John Wick - Parabellum', 'Aladdin', 'Chandigarh Amritsar Chandigarh']
N.P.
There may be a movie that comes up a second time with the precursor OCAP. I'm not 100% sure why it has that but it seems to be some kind of special screening that rotates through different movies each day.

Spacy to Conll format without using Spacy's sentence splitter

This post shows how to get dependencies of a block of text in Conll format with Spacy's taggers. This is the solution posted:
import spacy
nlp_en = spacy.load('en')
doc = nlp_en(u'Bob bought the pizza to Alice')
for sent in doc.sents:
for i, word in enumerate(sent):
if word.head == word:
head_idx = 0
else:
head_idx = word.head.i - sent[0].i + 1
print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
i+1, # There's a word.i attr that's position in *doc*
word,
word.lemma_,
word.tag_, # Fine-grained tag
word.ent_type_,
str(head_idx),
word.dep_ # Relation
))
It outputs this block:
1 Bob bob NNP PERSON 2 nsubj
2 bought buy VBD 0 ROOT
3 the the DT 4 det
4 pizza pizza NN 2 dobj
5 to to IN 2 dative
6 Alice alice NNP PERSON 5 pobj
I would like to get the same output WITHOUT using doc.sents.
Indeed, I have my own sentence-splitter. I would like to use it, and then give Spacy one sentence at a time to get POS, NER, and dependencies.
How can I get POS, NER, and dependencies of one sentence in Conll format with Spacy without having to use Spacy's sentence splitter ?
A Document in sPacy is iterable, and in the documentation is states that it iterates over Tokens
| __iter__(...)
| Iterate over `Token` objects, from which the annotations can be
| easily accessed. This is the main way of accessing `Token` objects,
| which are the main way annotations are accessed from Python. If faster-
| than-Python speeds are required, you can instead access the annotations
| as a numpy array, or access the underlying C data directly from Cython.
|
| EXAMPLE:
| >>> for token in doc
Therefore I believe you would just have to make a Document for each of your sentences that are split, then do something like the following:
def printConll(split_sentence_text):
doc = nlp(split_sentence_text)
for i, word in enumerate(doc):
if word.head == word:
head_idx = 0
else:
head_idx = word.head.i - sent[0].i + 1
print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
i+1, # There's a word.i attr that's position in *doc*
word,
word.lemma_,
word.tag_, # Fine-grained tag
word.ent_type_,
str(head_idx),
word.dep_ # Relation
))
Of course, following the CoNLL format you would have to print a newline after each sentence.
This post is about a user facing unexpected sentence breaks from using the spacy sentence boundary detection. One of the solutions proposed by the developers at Spacy (as on the post) is to add flexibility to add ones own sentence boundary detection rules. This problem is solved in conjunction with dependency parsing by Spacy, not before it. Therefore, I don't think what you're looking for is supported at all by Spacy at the moment, though it might be in the near future.
#ashu 's answer is partly right: dependency parsing and sentence boundary detection are tightly coupled by design in spaCy. Though there is a simple sentencizer.
https://spacy.io/api/sentencizer
It seems the sentecizer just uses punctuation (not the perfect way). But if such sentencizer exists then you can create a custom one using your rules and it will affect sentence boundaries for sure.

NLTK package to estimate the (unigram) perplexity

I am trying to calculate the perplexity for the data I have. The code I am using is:
import sys
sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk")
from nltk.corpus import brown
from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist, WittenBellProbDist
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(3, brown.words(categories='news'), True, False, estimator)
print lm
But I am receiving the error,
File "/usr/local/anaconda/lib/python2.7/site-packages/nltk/model/ngram.py", line 107, in __init__
cfd[context][token] += 1
TypeError: 'int' object has no attribute '__getitem__'
I have already performed Latent Dirichlet Allocation for the data I have and I have generated the unigrams and their respective probabilities (they are normalized as the sum of total probabilities of the data is 1).
My unigrams and their probability looks like:
Negroponte 1.22948976891e-05
Andreas 7.11290670484e-07
Rheinberg 7.08255885794e-07
Joji 4.48481435106e-07
Helguson 1.89936727391e-07
CAPTION_spot 2.37395965468e-06
Mortimer 1.48540253778e-07
yellow 1.26582575863e-05
Sugar 1.49563800878e-06
four 0.000207196011781
This is just a fragment of the unigrams file I have. The same format is followed for about 1000s of lines. The total probabilities (second column) summed gives 1.
I am a budding programmer. This ngram.py belongs to the nltk package and I am confused as to how to rectify this. The sample code I have here is from the nltk documentation and I don't know what to do now. Please help on what I can do. Thanks in advance!
Perplexity is the inverse probability of the test set, normalized by the number of words. In the case of unigrams:
Now you say you have already constructed the unigram model, meaning, for each word you have the relevant probability. Then you only need to apply the formula. I assume you have a big dictionary unigram[word] that would provide the probability of each word in the corpus. You also need to have a test set. If your unigram model is not in the form of a dictionary, tell me what data structure you have used, so I could adapt it to my solution accordingly.
perplexity = 1
N = 0
for word in testset:
if word in unigram:
N += 1
perplexity = perplexity * (1/unigram[word])
perplexity = pow(perplexity, 1/float(N))
UPDATE:
As you asked for a complete working example, here's a very simple one.
Suppose this is our corpus:
corpus ="""
Monty Python (sometimes known as The Pythons) were a British surreal comedy group who created the sketch comedy show Monty Python's Flying Circus,
that first aired on the BBC on October 5, 1969. Forty-five episodes were made over four series. The Python phenomenon developed from the television series
into something larger in scope and impact, spawning touring stage shows, films, numerous albums, several books, and a stage musical.
The group's influence on comedy has been compared to The Beatles' influence on music."""
Here's how we construct the unigram model first:
import collections, nltk
# we first tokenize the text corpus
tokens = nltk.word_tokenize(corpus)
#here you construct the unigram language model
def unigram(tokens):
model = collections.defaultdict(lambda: 0.01)
for f in tokens:
try:
model[f] += 1
except KeyError:
model [f] = 1
continue
N = float(sum(model.values()))
for word in model:
model[word] = model[word]/N
return model
Our model here is smoothed. For words outside the scope of its knowledge, it assigns a low probability of 0.01. I already told you how to compute perplexity:
#computes perplexity of the unigram model on a testset
def perplexity(testset, model):
testset = testset.split()
perplexity = 1
N = 0
for word in testset:
N += 1
perplexity = perplexity * (1/model[word])
perplexity = pow(perplexity, 1/float(N))
return perplexity
Now we can test this on two different test sets:
testset1 = "Monty"
testset2 = "abracadabra gobbledygook rubbish"
model = unigram(tokens)
print perplexity(testset1, model)
print perplexity(testset2, model)
for which you get the following result:
>>>
49.09452736318415
99.99999999999997
Note that when dealing with perplexity, we try to reduce it. A language model that has less perplexity with regards to a certain test set is more desirable than one with a bigger perplexity. In the first test set, the word Monty was included in the unigram model, so the respective number for perplexity was also smaller.
Thanks for the code snippet! Shouldn't:
for word in model:
model[word] = model[word]/float(sum(model.values()))
be rather:
v = float(sum(model.values()))
for word in model:
model[word] = model[word]/v
Oh ... I see was already answered ...

Missing Tweets from Twitter API (using Tweepy)?

I have been collecting tweets from the past week to collect the past-7-days tweets related to "lung cancer", yesterday, I figured I needed to start collecting more fields, so I added some fields and started re-collecting the same period of Tweets related to "lung cancer" from last week. The problem is, the first time I've collected ~2000 tweets related to lung cancer on 18th, Sept 2014. But last night, it only gave ~300 tweets, when I looked at the time of the tweets for this new set, it's only collecting tweets from something like ~23:29 to 23:59 on 18th Sept 2014. A large chunk of data is obviously missing. I don't think it's something with my code (below), I have tested various ways including deleting most of the fields to be collected and the time of data is still cut off prematurely.
Is this a known issue with Twitter API (when collecting last 7 days' data)? If so, it will be pretty horrible if someone is trying to do serious research. Or is it somewhere in my code that caused this (note: it runs perfectly fine for other previous/subsequent dates)?
import tweepy
import time
import csv
ckey = ""
csecret = ""
atoken = ""
asecret = ""
OAUTH_KEYS = {'consumer_key':ckey, 'consumer_secret':csecret,
'access_token_key':atoken, 'access_token_secret':asecret}
auth = tweepy.OAuthHandler(OAUTH_KEYS['consumer_key'], OAUTH_KEYS['consumer_secret'])
api = tweepy.API(auth)
# Stream the first "xxx" tweets related to "car", then filter out the ones without geo-enabled
# Reference of search (q) operator: https://dev.twitter.com/rest/public/search
# Common parameters: Changeable only here
startSince = '2014-09-18'
endUntil = '2014-09-20'
suffix = '_18SEP2014.csv'
############################
### Lung cancer starts #####
searchTerms2 = '"lung cancer" OR "lung cancers" OR "lungcancer" OR "lungcancers" OR \
"lung tumor" OR "lungtumor" OR "lung tumors" OR "lungtumors" OR "lung neoplasm"'
# Items from 0 to 500,000 (which *should* cover all tweets)
# Increase by 4,000 for each cycle (because 5000-6000 is over the Twitter rate limit)
# Then wait for 20 min before next request (becaues twitter request wait time is 15min)
counter2 = 0
for tweet in tweepy.Cursor(api.search, q=searchTerms2,
since=startSince, until=endUntil).items(999999999): # changeable here
try:
'''
print "Name:", tweet.author.name.encode('utf8')
print "Screen-name:", tweet.author.screen_name.encode('utf8')
print "Tweet created:", tweet.created_at'''
placeHolder = []
placeHolder.append(tweet.author.name.encode('utf8'))
placeHolder.append(tweet.author.screen_name.encode('utf8'))
placeHolder.append(tweet.created_at)
prefix = 'TweetData_lungCancer'
wholeFileName = prefix + suffix
with open(wholeFileName, "ab") as f: # changeable here
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
counter2 += 1
if counter2 == 4000:
time.sleep(60*20) # wait for 20 min everytime 4,000 tweets are extracted
counter2 = 0
continue
except tweepy.TweepError:
time.sleep(60*20)
continue
except IOError:
time.sleep(60*2.5)
continue
except StopIteration:
break
Update:
I have since tried running the same python scripts on a different computer (which is faster and more powerful than my home laptop). And the latter resulted in the expected number of tweets, I don't know why it's happening as my home laptop works fine for many programs, but I think we could rest the case and rule out the potential issues related to the scripts or Twitter API.
If you want to collect more data, I would highly recommend the streaming api that Tweepy has to offer. It has a much higher rate limit, in fact I was able to collect 500,000 tweets in just one day.
Also your rate limit checking is not very robust, you don't know for sure that Twitter will allow you to access 4000 tweets. From experience, I found that the more often you hit the rate limit the fewer tweets you are allowed and the longer you have to wait.
I would recommend using:
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
so that your application will not exceed the rate limit, alternatively you should check what you have used with:
print (api.rate_limit_status())
and then you can just sleep the thread like you have done.
Also your end date is incorrect. The end date should be '2014-09-21', one higher than whatever todays date is.