Semantic Intelligence with python - python-2.7

I am trying to classify words into a score, the scoring for now is to be very simple in that I just want to classify words based on -1, 0, 1 and sum the scores at the end. This classification would be based on the emotional connotation of the word so positive words like "great,awesome,excellent" would receive score of +1 and negative words like "bad, ill, not" would receive score of -1 and neutral words would receive 0 . For example;text = "I feel bad" would be pushed through a table,DB,library in which words were pre-classied and would summed into "I(0) + feel(0) + bad(-1) = -1
I have gone ahead and as an example stripped a website of its HTML coding using BeautifulSoup and urllib libraries (code below):
import urllib
from bs4 import BeautifulSoup
url = "http://www.greenovergrey.com/living-walls/what-are-living-walls.php"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
Output:
What are Living Walls? Definition of Green Wall and Vertical Garden
GREEN OVER CREY
Overview
/
What are living walls
/
Our green wall system vs. modular boxes
What are living walls
L iving walls or green walls are self sufficient vertical gardens that are attached to the exterior or interior of a building. They differ from green façades (e.g. ivy walls) in that the plants root in a structural support which is fastened to the wall itself. The plants receive water and nutrients from within the vertical support instead of from the ground.
The Green over Grey™ living wall system is different than others on the market today. It closely mimics nature and allows plants to grow to their full potential, without limitations. It is also by far the lightest.
Diversity is the key and by utilizing hundreds of different types of plants we create striking patterns and unique designs. We achieve this by utilizing the multitude of colours, textures and sizes that nature provides. Our system accommodates flowering perennials, beautiful foliage plants, ground covers and even allows for bushes, shrubs, and small trees!
Living walls are also referred to as green walls, vertical gardens or in French, mur végétal. The French botanist and artist Patrick Blanc was a pioneer by creating
the first vertical garden over 30 years ago.
Our system
consists of a frame, waterproof panels, an automatic irrigation system, special materials, lights when needed and of course plants. The frame is built in front of a pre existing wall and attached at various points; there is no damage done to the building. Waterproof panels are mounted to the frame; these are rigid and provide structural support. There is a layer of air between the building and the panels which enables the building to breath. This adds beneficial insulating properties and acts like rain-screening to protect the building envelop.
Our green walls are low maintenance thanks to an automatic irrigation system
my question is what would be the best way to run this string through a table or library of pre classified words and would anyone know of any existing libraries of preclassified words based on emotion? how can I create a small table or DB to test with really quick?
Thank you all in advance,
Rusty

If you have such a table you can find a list of such lexicons here: http://mpqa.cs.pitt.edu/lexicons/effect_lexicon/
You could load that list on a dictionary and perform the algorithm you describe. However, if you are looking for quick results, I recommend you use textblob library. It's very easy to use and it has a lot of features. A very nice place to start in a project like what you might be starting.

I dont know how to mark this question as a duplicate, but a quick google search turned this up.
The first answer looks promising. I went to the link and it just requires some information to access the file. I assume it would be in a format that is straightforward to parse.

Related

Osmnx: Removing sidewalk from one side of the street

I am trying to plot a simplified map for pedestrians in my university campus using Osmnx library with python 2.7.
So far, I have this Image of the plot and as you can see, it is plotting sidewalks on both sides of the street. I was planning on removing one side of the sidewalks from this.
However I'm confused what logic to approach this with?
So far, I have created a custom filter to plot only footways
custom_walk = ('["area"!~"yes"]["highway"="footway"]["foot"!~"no"]["service"!~"private"]{}').format(ox.settings.default_access)
G = ox.graph_from_bbox(top, bottom,right, left, custom_filter= custom_walk)
ox.plot_graph(G_projected,save = True,filename = "maps", show = False,node_size=5,node_color='#FFFFFF',node_edgecolor='#FFFFFF',edge_color='#cccccc',bgcolor = "#000000",node_zorder=3,dpi=300, edge_linewidth=5,use_geom=True)
ox.simplify.clean_intersections(G,tolerance=100)
What I am trying to understand is does Osmnx have relations for footways in a way that will tell me their relative position to the nearest street (if they are on the east or the north side of the street (that way I can keep a standard on what sidewalks are visible)? Or if there is a simpler logic at this?
Thanks!
What I am trying to understand is does Osmnx have relations for footways in a way that will tell me their relative position to the nearest street (if they are on the east or the north side of the street (that way I can keep a standard on what sidewalks are visible)? Or if there is a simpler logic at this?
The answer is no, OSMnx doesn't know where the sidewalk is in relation to the nearest street. One option might be to just identify the sidewalk edges you don't want, make a list of their OSM IDs, then remove them from the graph.

Gensim: Word2Vec Recommender accuracy Improvement

I am trying to implement something similar in https://arxiv.org/pdf/1603.04259.pdf using awesome gensim library however I am having trouble improving quality of results when I compare to Collaborative Filtering.
I have two models one built on Apache Spark and other one using gensim Word2Vec on grouplens 20 million ratings dataset. My apache spark model is hosted on AWS http://sparkmovierecommender.us-east-1.elasticbeanstalk.com
and I am running gensim model on my local. However when I compare the results I see superior results with CF model 9 out of 10 times(like below example more similar to searched movie - affinity towards Marvel movies)
e.g.:- If I search for Thor movie I get below results
Gensim
Captain America: The First Avenger (2011)
X-Men: First Class (2011)
Rise of the Planet of the Apes (2011)
Iron Man 2 (2010)
X-Men Origins: Wolverine (2009)
Green Lantern (2011)
Super 8 (2011)
Tron:Legacy (2010)
Transformers: Dark of the Moon (2011)
CF
Captain America: The First Avenger
Iron Man 2
Thor: The Dark World
Iron Man
The Avengers
X-Men: First Class
Iron Man 3
Star Trek
Captain America: The Winter Soldier
Below is my model configuration, so far I have tried playing with window, min_count and size parameter but not much improvement.
word2vec_model = gensim.models.Word2Vec(
seed=1,
size=100,
min_count=50,
window=30)
word2vec_model.train(movie_list, total_examples=len(movie_list), epochs=10)
Any help in this regard is appreciated.
You don't mention what Collaborative Filtering algorithm you're trying, but maybe it's just better than Word2Vec for this purpose. (Word2Vec is not doing awful; why do you expect it to be better?)
Alternate meta-parameters might do better.
For example, the window is the max-distance between tokens that might affect each other, but the effective windows used in each target-token training randomly chosen from 1 to window, as a way to give nearby tokens more weight. Thus when some training-texts are much larger than the window (as in your example row), some of the correlations will be ignored (or underweighted). Unless ordering is very significant, a giant window (MAX_INT?) might do better, or even a related method where ordering is irrelevant (such as Doc2Vec in pure PV-DBOW dm=0 mode, with every token used as a doc-tag).
Depending on how much data you have, your size might be too large or small. Different min_count, negative count, greater 'iter'/'epochs', or sample level might work much better. (And perhaps even things you've already tinkered with would only help after other changes are in place.)

word2vec guesing word embeddings

can word2vec be used for guessing words with just context?
having trained the model with a large data set e.g. Google news how can I use word2vec to predict a similar word with only context e.g. with input ", who dominated chess for more than 15 years, will compete against nine top players in St Louis, Missouri." The output should be Kasparov or maybe Carlsen.
I'ven seen only the similarity apis but I can't make sense how to use them for this? is this not how word2vec was intented to use?
It is not the intended use of word2vec. The word2vec algorithm internally tries to predict exact words, using surrounding words, as a roundabout way to learn useful vectors for those surrounding words.
But even so, it's not forming exact predictions during training. It's just looking at a single narrow training example – context words and target word – and performing a very simple comparison and internal nudge to make its conformance to that one example slightly better. Over time, that self-adjusts towards useful vectors – even if the predictions remain of wildly-varying quality.
Most word2vec libraries don't offer a direct interface for showing ranked predictions, given context words. The Python gensim library, for the last few versions (as of current version 2.2.0 in July 2017), has offered a predict_output_word() method that roughly shows what the model would predict, given context-words, for some training modes. See:
https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.predict_output_word
However, considering your fill-in-the-blank query (also called a 'cloze deletion' in related education or machine-learning contexts):
_____, who dominated chess for more than 15 years, will compete against nine top players in St Louis, Missouri
A vanilla word2vec model is unlikely to get that right. It has little sense of the relative importance of words (except when some words are more narrowly predictive of others). It has no sense of grammar/ordering, or or of the compositional-meaning of connected-phrases (like 'dominated chess' as opposed to the separate words 'dominated' and 'chess'). Even though words describing the same sorts of things are usually near each other, it doesn't know categories to be able to determine that the blank must be a 'person' and a 'chess player', and the fuzzy-similarities of word2vec don't guarantee words-of-a-class will necessarily all be nearer-each-other than other words.
There has been a bunch of work to train word/concept vectors (aka 'dense embeddings') to be better at helping at such question-answering tasks. A random example might be "Creating Causal Embeddings for Question Answering with Minimal Supervision" but queries like [word2vec question answering] or [embeddings for question answering] will find lots more. I don't know of easy out-of-the-box libraries for doing this, with or without a core of word2vec, though.

Speech Recognition for small vocabulary (about 20 words)

I am currently working on a project for my university. The task is to write speech recognition system that is going to run on a phone in background waiting for few commands (like. call 0 123 ...).
It's 2 months project so it does not have to be very accurate. The amount of acceptable noise can be small and words will be separated by moments of silence.
I am currently at point of loading sample word encoded in RAW 16 bit PCM format. Splitting it to chunks (about 50 per second) and running FFT on each chunk in order to get frequency spectrum.
Things to solve are:
1) going through the longer recording and splitting it into words.
2) finding to best match for the word
1) I was thinking about just checking chunk after chunk and if I encounter few chunks that have higher altitudes of human voice frequencies assume that the word has started. Anyway I am looking for resources that may help with this.
2) This one seams a little bit tougher. Is it necessary to use HMM's for system like this or maybe there are simpler methods assuming that the vocabulary is so small ( 20 words )?
Edit:
The point of the project is writing the system on my own so I cannot use ready libraries like Sphinx or HTK.
Regards,
Karol
If anybody will have the same question in future. Look for 2 main keywords:
MFCC - Mel-Frequency cepstrum coefficients to calculate series of coefficients for each word template
DTW - To match captured word with templates
Good enough description of DTW can be found on wikipedia
This approach was good enough to have around 80% accuracy on 20 words dictionary and give a good demo during the class.
To recognize commands on the phone you can use Pocketsphinx. Tutorial which covers speech recognition applications on Android is available on CMUSphinx website.

How to find the most frequent words before and after a given word in a given text in python?

I have a big text and I am trying to get most frequently word occurrences before and after a given word in this text.
For example:
I want to know what is the most frequent word occurrence after "lake". Idealy would get something like that: (word 1,# occurrence), (word 2,# occurrence),...
The same for the words which would come before...
I tried the NLTK bigran but it seems it only find the most common n-grans... Is it possible somehow to fix one of the words and find the most frequent n-grans based on the fixed word)?
Thanks for any help!!
Are you looking for something like this?
text = """
A lake is a body of relatively still water of considerable size, localized in a basin, that is surrounded by land apart from a river, stream, or other form of moving water that serves to feed or drain the lake. Lakes are inland and not part of the ocean and therefore are distinct from lagoons, and are larger and deeper than ponds.[1][2] Lakes can be contrasted with rivers or streams, which are usually flowing. However most lakes are fed and drained by rivers and streams.
Natural lakes are generally found in mountainous areas, rift zones, and areas with ongoing glaciation. Other lakes are found in endorheic basins or along the courses of mature rivers. In some parts of the world there are many lakes because of chaotic drainage patterns left over from the last Ice Age. All lakes are temporary over geologic time scales, as they will slowly fill in with sediments or spill out of the basin containing them.
Many lakes are artificial and are constructed for industrial or agricultural use, for hydro-electric power generation or domestic water supply, or for aesthetic or recreational purposes.
Etymology, meaning, and usage of "lake"[edit]
Oeschinen Lake in the Swiss Alps
Lake Tahoe on the border of California and Nevada
The Caspian Sea is either the world's largest lake or a full-fledged sea.[3]
The word lake comes from Middle English lake ("lake, pond, waterway"), from Old English lacu ("pond, pool, stream"), from Proto-Germanic *lakō ("pond, ditch, slow moving stream"), from the Proto-Indo-European root *leǵ- ("to leak, drain"). Cognates include Dutch laak ("lake, pond, ditch"), Middle Low German lāke ("water pooled in a riverbed, puddle"), German Lache ("pool, puddle"), and Icelandic lækur ("slow flowing stream"). Also related are the English words leak and leach.
There is considerable uncertainty about defining the difference between lakes and ponds, and no current internationally accepted definition of either term across scientific disciplines or political boundaries exists.[4] For example, limnologists have defined lakes as water bodies which are simply a larger version of a pond, which can have wave action on the shoreline or where wind-induced turbulence plays a major role in mixing the water column. None of these definitions completely excludes ponds and all are difficult to measure. For this reason there has been increasing use made of simple size-based definitions to separate ponds and lakes. One definition of lake is a body of water of 2 hectares (5 acres) or more in area;[5]:331[6] however, others[who?] have defined lakes as waterbodies of 5 hectares (12 acres) and above,[citation needed] or 8 hectares (20 acres) and above[citation needed] (see also the definition of "pond"). Charles Elton, one of the founders of ecology, regarded lakes as waterbodies of 40 hectares (99 acres) or more.[7] The term lake is also used to describe a feature such as Lake Eyre, which is a dry basin most of the time but may become filled under seasonal conditions of heavy rainfall. In common usage many lakes bear names ending with the word pond, and a lesser number of names ending with lake are in quasi-technical fact, ponds. One textbook illustrates this point with the following: "In Newfoundland, for example, almost every lake is called a pond, whereas in Wisconsin, almost every pond is called a lake."[8]
One hydrology book proposes to define it as a body of water with the following five chacteristics:[4]
it partially or totally fills one or several basins connected by straits[4]
has essentially the same water level in all parts (except for relatively short-lived variations caused by wind, varying ice cover, large inflows, etc.)[4]
it does not have regular intrusion of sea water[4]
a considerable portion of the sediment suspended in the water is captured by the basins (for this to happen they need to have a sufficiently small inflow-to-volume ratio)[4]
the area measured at the mean water level exceeds an arbitrarily chosen threshold (for instance, one hectare)[4]
With the exception of the sea water intrusion criterion, the other ones have been accepted or elaborated upon by other hydrology publications.[9][10]
""".split()
from nltk import bigrams
bgs = bigrams(text)
lake_bgs = filter(lambda item: item[0] == 'lake', bgs)
from collections import Counter
c = Counter(map(lambda item: item[1], lake_bgs))
print c.most_common()
Which output:
[('is', 4), ('("lake,', 1), ('or', 1), ('comes', 1), ('are', 1)]
Note, that you might want to use ifilter, imap, etc... if you have a very long text.
Edit: Here is the code for before and after 'lake'.
from nltk import trigrams
tgs = trigrams(text)
lake_tgs = filter(lambda item: item[1] == 'lake', tgs)
from collections import Counter
before_lake = map(lambda item: item[0], lake_tgs)
after_lake = map(lambda item: item[2], lake_tgs)
c = Counter(before_lake + after_lake)
print c.most_common()
Note that this can be done using bigrams as well :)
Just to add to #Ohad's answer, here's an ngram implementation in NLTK with some scalability.
#-*- coding: utf8 -*-
import string
from nltk import ngrams
from itertools import chain
from collections import Counter
text = """
A lake is a body of relatively still water of considerable size, localized in a basin, that is surrounded by land apart from a river, stream, or other form of moving water that serves to feed or drain the lake. Lakes are inland and not part of the ocean and therefore are distinct from lagoons, and are larger and deeper than ponds.[1][2] Lakes can be contrasted with rivers or streams, which are usually flowing. However most lakes are fed and drained by rivers and streams.
Natural lakes are generally found in mountainous areas, rift zones, and areas with ongoing glaciation. Other lakes are found in endorheic basins or along the courses of mature rivers. In some parts of the world there are many lakes because of chaotic drainage patterns left over from the last Ice Age. All lakes are temporary over geologic time scales, as they will slowly fill in with sediments or spill out of the basin containing them.
Many lakes are artificial and are constructed for industrial or agricultural use, for hydro-electric power generation or domestic water supply, or for aesthetic or recreational purposes.
Etymology, meaning, and usage of "lake"[edit]
Oeschinen Lake in the Swiss Alps
Lake Tahoe on the border of California and Nevada
The Caspian Sea is either the world's largest lake or a full-fledged sea.[3]
The word lake comes from Middle English lake ("lake, pond, waterway"), from Old English lacu ("pond, pool, stream"), from Proto-Germanic *lakō ("pond, ditch, slow moving stream"), from the Proto-Indo-European root *leǵ- ("to leak, drain"). Cognates include Dutch laak ("lake, pond, ditch"), Middle Low German lāke ("water pooled in a riverbed, puddle"), German Lache ("pool, puddle"), and Icelandic lækur ("slow flowing stream"). Also related are the English words leak and leach.
There is considerable uncertainty about defining the difference between lakes and ponds, and no current internationally accepted definition of either term across scientific disciplines or political boundaries exists.[4] For example, limnologists have defined lakes as water bodies which are simply a larger version of a pond, which can have wave action on the shoreline or where wind-induced turbulence plays a major role in mixing the water column. None of these definitions completely excludes ponds and all are difficult to measure. For this reason there has been increasing use made of simple size-based definitions to separate ponds and lakes. One definition of lake is a body of water of 2 hectares (5 acres) or more in area;[5]:331[6] however, others[who?] have defined lakes as waterbodies of 5 hectares (12 acres) and above,[citation needed] or 8 hectares (20 acres) and above[citation needed] (see also the definition of "pond"). Charles Elton, one of the founders of ecology, regarded lakes as waterbodies of 40 hectares (99 acres) or more.[7] The term lake is also used to describe a feature such as Lake Eyre, which is a dry basin most of the time but may become filled under seasonal conditions of heavy rainfall. In common usage many lakes bear names ending with the word pond, and a lesser number of names ending with lake are in quasi-technical fact, ponds. One textbook illustrates this point with the following: "In Newfoundland, for example, almost every lake is called a pond, whereas in Wisconsin, almost every pond is called a lake."[8]
One hydrology book proposes to define it as a body of water with the following five chacteristics:[4]
it partially or totally fills one or several basins connected by straits[4]
has essentially the same water level in all parts (except for relatively short-lived variations caused by wind, varying ice cover, large inflows, etc.)[4]
it does not have regular intrusion of sea water[4]
a considerable portion of the sediment suspended in the water is captured by the basins (for this to happen they need to have a sufficiently small inflow-to-volume ratio)[4]
the area measured at the mean water level exceeds an arbitrarily chosen threshold (for instance, one hectare)[4]
With the exception of the sea water intrusion criterion, the other ones have been accepted or elaborated upon by other hydrology publications.[9][10]
"""
def ngrammer(txt, n):
# Removes punctuations and numbers.
sentences = "".join([i for i in txt if i not in string.punctuation and not i.isdigit()]).split('\n')
return list(chain(*[ngrams(i.split(), n) for i in sentences]))
def before_after(ngs, word):
word_grams = filter(lambda item: item[1] == word, ngs)
before = map(lambda item: item[0], ngs)
after = map(lambda item: item[2], ngs)
return before, after
bgs = ngrammer(text,2) # bigrams
tgs = ngrammer(text,3) # trigrams
xgs = ngrammer(text,10) # 10grams
focus = 'lake'
bf, af = before_after(xgs, focus)
c = Counter(bf+af)
# Most common word before and after 'lake' from the 10grams.
print c.most_common()[0]