I use Google Cloud Speech Transcription as following :
video_client = videointelligence.VideoIntelligenceServiceClient()
features = [videointelligence.enums.Feature.SPEECH_TRANSCRIPTION]
operation = video_client.annotate_video(gs_video_path, features=features)
result = operation.result(timeout=3600)
And I present the transcript and store the transcript in Django Objects using PostgreSQL as following :
transcriptions = response.annotation_results[0].speech_transcriptions
for transcription in transcriptions:
best_alternative = transcription.alternatives[0]
confidence = best_alternative.confidence
transcript = best_alternative.transcript
if SpeechTranscript.objects.filter(text = transcript).count() == 0:
SpeechTranscript.objects.create(text = transcript,
confidence = confidence)
print(f"Adding -> {confidence:4.10%} | {transcript.strip()}")
else:
pass
For instance the following is the text that I receive from a sample video :
94.9425220490% | I refuse to where is it short sleeve dress shirt. I'm just not going there the president of the United States is a visit to Walter Reed hospital in mid-july format was the combination of weeks of cajoling by trump staff and allies to get the presents for both public health and political perspective wearing a mask to protect against the spread of covid-19 reported in advance of that watery trip and I quote one presidential aide to the president to set an example for a supporters by wearing a mask and the visit.
94.3865835667% | Mask wearing is because well science our best way to slow the spread of the coronavirus. Yes trump or Matthew or 3 but if you know what he said while doing sell it still anybody's guess about what can you really think about NASCAR here is what probably have a mass give you probably have a hospital especially and that particular setting were you talking to a lot of people I think it's but I do believe it. Have a a time and a place very special line trump saying I've never been against masks but I do believe they have a time and a place that isn't exactly a ringing endorsement for mask wearing.
94.8513686657% | Republican skip this isn't it up to four men over the perfumer's that wine about time and place should be a blinking red warning light for people who think debate over whether last for you for next coronavirus. They are is finally behind us time in a place lined everything you need to know about weird Trump is like headed next time he'll get watery because it was a hospital and will continue to express not so scepticism to wear masks in public house new CDC guidelines recommending that mask to be worn inside and one social this thing is it possible outside he sent this?
92.9862976074% | He wearing a face mask as agreed presidents prime minister's dictators Kings Queens and somehow. I don't see it for myself literally main door he responded this way back backstage, but they said you didn't need it trump went to Michigan to this later and he appeared in which personality approaching Mark former vice president Joe Biden
94.6677267551% | In his microwave fighting for wearing a mask and he walked onto the stage where it is massive mask there's nobody understands and there's any takes it off you like to have it hanging off you. I think it makes them feel good frankly if you want to know the truth who's got the largest basket together. Seen it because trump thinks that maths make him and people generally I guess what a week or something is resistant wearing one in public from 1 today which has had a correlation between the erosion of the public's confidence and trump have the corner coronavirus and his number is SE6 a second term in the 67.
94.9921131134% | The coronavirus pandemic in the heels of national and swings they both lots of them that show trump slipping further and further behind former vice president Joe Biden when it comes to General Election good policy would seem to make for good politics at all virtually every infectious disease expert believes that wearing masks in public is our best to contain the spread of coronavirus until a vaccine would do well to listen to buy on this one a mare is the point we make episode every Tuesday and Thursday make sure to check them all out.
What is the predicted size of a transcript that is generated within the speech transcription results. What decides the size of each transcript ? What is the max and minimum character length ? How should I design my SQL table column size, in order to be prepared for the expected transcript size ?
As I mentioned in the comments, the Video Intelligence transcripts are splits with roughly 50-60 seconds from the video.
I have created a Public Issue Tracker case, link, so the product team can clarify this information within the documentation. Although, I do not have an eta for this request, I encourage you to follow the case's thread.
I am trying to implement something similar in https://arxiv.org/pdf/1603.04259.pdf using awesome gensim library however I am having trouble improving quality of results when I compare to Collaborative Filtering.
I have two models one built on Apache Spark and other one using gensim Word2Vec on grouplens 20 million ratings dataset. My apache spark model is hosted on AWS http://sparkmovierecommender.us-east-1.elasticbeanstalk.com
and I am running gensim model on my local. However when I compare the results I see superior results with CF model 9 out of 10 times(like below example more similar to searched movie - affinity towards Marvel movies)
e.g.:- If I search for Thor movie I get below results
Gensim
Captain America: The First Avenger (2011)
X-Men: First Class (2011)
Rise of the Planet of the Apes (2011)
Iron Man 2 (2010)
X-Men Origins: Wolverine (2009)
Green Lantern (2011)
Super 8 (2011)
Tron:Legacy (2010)
Transformers: Dark of the Moon (2011)
CF
Captain America: The First Avenger
Iron Man 2
Thor: The Dark World
Iron Man
The Avengers
X-Men: First Class
Iron Man 3
Star Trek
Captain America: The Winter Soldier
Below is my model configuration, so far I have tried playing with window, min_count and size parameter but not much improvement.
word2vec_model = gensim.models.Word2Vec(
seed=1,
size=100,
min_count=50,
window=30)
word2vec_model.train(movie_list, total_examples=len(movie_list), epochs=10)
Any help in this regard is appreciated.
You don't mention what Collaborative Filtering algorithm you're trying, but maybe it's just better than Word2Vec for this purpose. (Word2Vec is not doing awful; why do you expect it to be better?)
Alternate meta-parameters might do better.
For example, the window is the max-distance between tokens that might affect each other, but the effective windows used in each target-token training randomly chosen from 1 to window, as a way to give nearby tokens more weight. Thus when some training-texts are much larger than the window (as in your example row), some of the correlations will be ignored (or underweighted). Unless ordering is very significant, a giant window (MAX_INT?) might do better, or even a related method where ordering is irrelevant (such as Doc2Vec in pure PV-DBOW dm=0 mode, with every token used as a doc-tag).
Depending on how much data you have, your size might be too large or small. Different min_count, negative count, greater 'iter'/'epochs', or sample level might work much better. (And perhaps even things you've already tinkered with would only help after other changes are in place.)
I am trying to classify words into a score, the scoring for now is to be very simple in that I just want to classify words based on -1, 0, 1 and sum the scores at the end. This classification would be based on the emotional connotation of the word so positive words like "great,awesome,excellent" would receive score of +1 and negative words like "bad, ill, not" would receive score of -1 and neutral words would receive 0 . For example;text = "I feel bad" would be pushed through a table,DB,library in which words were pre-classied and would summed into "I(0) + feel(0) + bad(-1) = -1
I have gone ahead and as an example stripped a website of its HTML coding using BeautifulSoup and urllib libraries (code below):
import urllib
from bs4 import BeautifulSoup
url = "http://www.greenovergrey.com/living-walls/what-are-living-walls.php"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
Output:
What are Living Walls? Definition of Green Wall and Vertical Garden
GREEN OVER CREY
Overview
/
What are living walls
/
Our green wall system vs. modular boxes
What are living walls
L iving walls or green walls are self sufficient vertical gardens that are attached to the exterior or interior of a building. They differ from green façades (e.g. ivy walls) in that the plants root in a structural support which is fastened to the wall itself. The plants receive water and nutrients from within the vertical support instead of from the ground.
The Green over Grey™ living wall system is different than others on the market today. It closely mimics nature and allows plants to grow to their full potential, without limitations. It is also by far the lightest.
Diversity is the key and by utilizing hundreds of different types of plants we create striking patterns and unique designs. We achieve this by utilizing the multitude of colours, textures and sizes that nature provides. Our system accommodates flowering perennials, beautiful foliage plants, ground covers and even allows for bushes, shrubs, and small trees!
Living walls are also referred to as green walls, vertical gardens or in French, mur végétal. The French botanist and artist Patrick Blanc was a pioneer by creating
the first vertical garden over 30 years ago.
Our system
consists of a frame, waterproof panels, an automatic irrigation system, special materials, lights when needed and of course plants. The frame is built in front of a pre existing wall and attached at various points; there is no damage done to the building. Waterproof panels are mounted to the frame; these are rigid and provide structural support. There is a layer of air between the building and the panels which enables the building to breath. This adds beneficial insulating properties and acts like rain-screening to protect the building envelop.
Our green walls are low maintenance thanks to an automatic irrigation system
my question is what would be the best way to run this string through a table or library of pre classified words and would anyone know of any existing libraries of preclassified words based on emotion? how can I create a small table or DB to test with really quick?
Thank you all in advance,
Rusty
If you have such a table you can find a list of such lexicons here: http://mpqa.cs.pitt.edu/lexicons/effect_lexicon/
You could load that list on a dictionary and perform the algorithm you describe. However, if you are looking for quick results, I recommend you use textblob library. It's very easy to use and it has a lot of features. A very nice place to start in a project like what you might be starting.
I dont know how to mark this question as a duplicate, but a quick google search turned this up.
The first answer looks promising. I went to the link and it just requires some information to access the file. I assume it would be in a format that is straightforward to parse.
I have a big text and I am trying to get most frequently word occurrences before and after a given word in this text.
For example:
I want to know what is the most frequent word occurrence after "lake". Idealy would get something like that: (word 1,# occurrence), (word 2,# occurrence),...
The same for the words which would come before...
I tried the NLTK bigran but it seems it only find the most common n-grans... Is it possible somehow to fix one of the words and find the most frequent n-grans based on the fixed word)?
Thanks for any help!!
Are you looking for something like this?
text = """
A lake is a body of relatively still water of considerable size, localized in a basin, that is surrounded by land apart from a river, stream, or other form of moving water that serves to feed or drain the lake. Lakes are inland and not part of the ocean and therefore are distinct from lagoons, and are larger and deeper than ponds.[1][2] Lakes can be contrasted with rivers or streams, which are usually flowing. However most lakes are fed and drained by rivers and streams.
Natural lakes are generally found in mountainous areas, rift zones, and areas with ongoing glaciation. Other lakes are found in endorheic basins or along the courses of mature rivers. In some parts of the world there are many lakes because of chaotic drainage patterns left over from the last Ice Age. All lakes are temporary over geologic time scales, as they will slowly fill in with sediments or spill out of the basin containing them.
Many lakes are artificial and are constructed for industrial or agricultural use, for hydro-electric power generation or domestic water supply, or for aesthetic or recreational purposes.
Etymology, meaning, and usage of "lake"[edit]
Oeschinen Lake in the Swiss Alps
Lake Tahoe on the border of California and Nevada
The Caspian Sea is either the world's largest lake or a full-fledged sea.[3]
The word lake comes from Middle English lake ("lake, pond, waterway"), from Old English lacu ("pond, pool, stream"), from Proto-Germanic *lakō ("pond, ditch, slow moving stream"), from the Proto-Indo-European root *leǵ- ("to leak, drain"). Cognates include Dutch laak ("lake, pond, ditch"), Middle Low German lāke ("water pooled in a riverbed, puddle"), German Lache ("pool, puddle"), and Icelandic lækur ("slow flowing stream"). Also related are the English words leak and leach.
There is considerable uncertainty about defining the difference between lakes and ponds, and no current internationally accepted definition of either term across scientific disciplines or political boundaries exists.[4] For example, limnologists have defined lakes as water bodies which are simply a larger version of a pond, which can have wave action on the shoreline or where wind-induced turbulence plays a major role in mixing the water column. None of these definitions completely excludes ponds and all are difficult to measure. For this reason there has been increasing use made of simple size-based definitions to separate ponds and lakes. One definition of lake is a body of water of 2 hectares (5 acres) or more in area;[5]:331[6] however, others[who?] have defined lakes as waterbodies of 5 hectares (12 acres) and above,[citation needed] or 8 hectares (20 acres) and above[citation needed] (see also the definition of "pond"). Charles Elton, one of the founders of ecology, regarded lakes as waterbodies of 40 hectares (99 acres) or more.[7] The term lake is also used to describe a feature such as Lake Eyre, which is a dry basin most of the time but may become filled under seasonal conditions of heavy rainfall. In common usage many lakes bear names ending with the word pond, and a lesser number of names ending with lake are in quasi-technical fact, ponds. One textbook illustrates this point with the following: "In Newfoundland, for example, almost every lake is called a pond, whereas in Wisconsin, almost every pond is called a lake."[8]
One hydrology book proposes to define it as a body of water with the following five chacteristics:[4]
it partially or totally fills one or several basins connected by straits[4]
has essentially the same water level in all parts (except for relatively short-lived variations caused by wind, varying ice cover, large inflows, etc.)[4]
it does not have regular intrusion of sea water[4]
a considerable portion of the sediment suspended in the water is captured by the basins (for this to happen they need to have a sufficiently small inflow-to-volume ratio)[4]
the area measured at the mean water level exceeds an arbitrarily chosen threshold (for instance, one hectare)[4]
With the exception of the sea water intrusion criterion, the other ones have been accepted or elaborated upon by other hydrology publications.[9][10]
""".split()
from nltk import bigrams
bgs = bigrams(text)
lake_bgs = filter(lambda item: item[0] == 'lake', bgs)
from collections import Counter
c = Counter(map(lambda item: item[1], lake_bgs))
print c.most_common()
Which output:
[('is', 4), ('("lake,', 1), ('or', 1), ('comes', 1), ('are', 1)]
Note, that you might want to use ifilter, imap, etc... if you have a very long text.
Edit: Here is the code for before and after 'lake'.
from nltk import trigrams
tgs = trigrams(text)
lake_tgs = filter(lambda item: item[1] == 'lake', tgs)
from collections import Counter
before_lake = map(lambda item: item[0], lake_tgs)
after_lake = map(lambda item: item[2], lake_tgs)
c = Counter(before_lake + after_lake)
print c.most_common()
Note that this can be done using bigrams as well :)
Just to add to #Ohad's answer, here's an ngram implementation in NLTK with some scalability.
#-*- coding: utf8 -*-
import string
from nltk import ngrams
from itertools import chain
from collections import Counter
text = """
A lake is a body of relatively still water of considerable size, localized in a basin, that is surrounded by land apart from a river, stream, or other form of moving water that serves to feed or drain the lake. Lakes are inland and not part of the ocean and therefore are distinct from lagoons, and are larger and deeper than ponds.[1][2] Lakes can be contrasted with rivers or streams, which are usually flowing. However most lakes are fed and drained by rivers and streams.
Natural lakes are generally found in mountainous areas, rift zones, and areas with ongoing glaciation. Other lakes are found in endorheic basins or along the courses of mature rivers. In some parts of the world there are many lakes because of chaotic drainage patterns left over from the last Ice Age. All lakes are temporary over geologic time scales, as they will slowly fill in with sediments or spill out of the basin containing them.
Many lakes are artificial and are constructed for industrial or agricultural use, for hydro-electric power generation or domestic water supply, or for aesthetic or recreational purposes.
Etymology, meaning, and usage of "lake"[edit]
Oeschinen Lake in the Swiss Alps
Lake Tahoe on the border of California and Nevada
The Caspian Sea is either the world's largest lake or a full-fledged sea.[3]
The word lake comes from Middle English lake ("lake, pond, waterway"), from Old English lacu ("pond, pool, stream"), from Proto-Germanic *lakō ("pond, ditch, slow moving stream"), from the Proto-Indo-European root *leǵ- ("to leak, drain"). Cognates include Dutch laak ("lake, pond, ditch"), Middle Low German lāke ("water pooled in a riverbed, puddle"), German Lache ("pool, puddle"), and Icelandic lækur ("slow flowing stream"). Also related are the English words leak and leach.
There is considerable uncertainty about defining the difference between lakes and ponds, and no current internationally accepted definition of either term across scientific disciplines or political boundaries exists.[4] For example, limnologists have defined lakes as water bodies which are simply a larger version of a pond, which can have wave action on the shoreline or where wind-induced turbulence plays a major role in mixing the water column. None of these definitions completely excludes ponds and all are difficult to measure. For this reason there has been increasing use made of simple size-based definitions to separate ponds and lakes. One definition of lake is a body of water of 2 hectares (5 acres) or more in area;[5]:331[6] however, others[who?] have defined lakes as waterbodies of 5 hectares (12 acres) and above,[citation needed] or 8 hectares (20 acres) and above[citation needed] (see also the definition of "pond"). Charles Elton, one of the founders of ecology, regarded lakes as waterbodies of 40 hectares (99 acres) or more.[7] The term lake is also used to describe a feature such as Lake Eyre, which is a dry basin most of the time but may become filled under seasonal conditions of heavy rainfall. In common usage many lakes bear names ending with the word pond, and a lesser number of names ending with lake are in quasi-technical fact, ponds. One textbook illustrates this point with the following: "In Newfoundland, for example, almost every lake is called a pond, whereas in Wisconsin, almost every pond is called a lake."[8]
One hydrology book proposes to define it as a body of water with the following five chacteristics:[4]
it partially or totally fills one or several basins connected by straits[4]
has essentially the same water level in all parts (except for relatively short-lived variations caused by wind, varying ice cover, large inflows, etc.)[4]
it does not have regular intrusion of sea water[4]
a considerable portion of the sediment suspended in the water is captured by the basins (for this to happen they need to have a sufficiently small inflow-to-volume ratio)[4]
the area measured at the mean water level exceeds an arbitrarily chosen threshold (for instance, one hectare)[4]
With the exception of the sea water intrusion criterion, the other ones have been accepted or elaborated upon by other hydrology publications.[9][10]
"""
def ngrammer(txt, n):
# Removes punctuations and numbers.
sentences = "".join([i for i in txt if i not in string.punctuation and not i.isdigit()]).split('\n')
return list(chain(*[ngrams(i.split(), n) for i in sentences]))
def before_after(ngs, word):
word_grams = filter(lambda item: item[1] == word, ngs)
before = map(lambda item: item[0], ngs)
after = map(lambda item: item[2], ngs)
return before, after
bgs = ngrammer(text,2) # bigrams
tgs = ngrammer(text,3) # trigrams
xgs = ngrammer(text,10) # 10grams
focus = 'lake'
bf, af = before_after(xgs, focus)
c = Counter(bf+af)
# Most common word before and after 'lake' from the 10grams.
print c.most_common()[0]
I am looking for a program that can make burndown charts which does not
assume that just because a day passes by, all work time for that day
automatically is assumed to have turned into progress for the current
sprint. I am thus not particularly interested in finishing a sprint at
some specific date, however I am interested in keeping track of if the
estimate is accurate.
I am only intending to use this for private programming (and
non-programming) projects, so it does not have to be a full fledged
scrum team solution (although I assume it would be).
To better explain what I am looking for, let's imagine I have a project
"Paint my house" with a single sprint consisting of nine tasks:
Buy paint, brushes and cleaning liquid.
Wash the North wall.
Wash the West wall.
Wash the South wall.
Wash the East wall.
Paint the North wall.
Paint the West wall.
Paint the South wall.
Paint the East wall.
Since this will be done in my spare time, at any day I might down-prioritize
this and do other stuff. And the painting is highly dependent
on the weather as well. Therefore a calender day passing does in
absolutely no way imply that the project will make progress for that day.
Every single application that I have found that can make burndown charts
fails utterly to fit this scenario. They all assume "calender time
passing equals progress". I want to supply the expected progress manually.
Any suggestions for a tool that is able to handle a project in this way?
(Related questions, but which does not provide me with an answer to my question.
https://stackoverflow.com/questions/829497/agile-methods-specifically-taylored-to-working-solo,
How have you implemented SCRUM for working alone?,
Using Scrum on a "Personal Time" Project)
Every single application that I have found that can make burndown charts fails utterly to fit this scenario.
That's because the whole point of a burndown chart is to predict when the sprint will finish, and to know whether you're on schedule or not. If you cannot paint because it rains, then you cannot make progress and you're then behind schedule, as the burndown chart will show. But if you make time a variable, then you have no schedule -- progress becomes independent of time -- and the trendline is completely unpredictable. So there's no point of having a burndown chart if the progression of time is unknown.
I think you are looking for something like kanban instead of scrum.
Here's an example of a chart in kanban.
http://www.targetprocess.com/blog/2010/02/cumulative-flow-chart-in-kanban-real-usage-example.html
HTH (6 months later)