Web service or mechanism to detect Person, Place or an Object - web-services

Is there a web service or a tool to detect if what a certain text is the name or a person, a place or an object (device).
eg:
Input: Bill Clinton Output: Person
Input: Blackberry Output: Device
Input: New york Output: Place
Accuracy can be low. I have looked at opencyc but I couldnt get it to work. Is there a way I can use WikiPedia for this?
For a start separating a person or a thing will be great.

I think wikipedia would be a very good source. Given the input, you could try and find an entry in wikipedia and scrape the resulting page (if it exists).
Persons and Places should have fairly distinct sets of data - birthdates, locations, etc in the article that you could use to tell them apart, and anything else is an object.
It's worth a shot anyway.

Looking at the output of Wolfram Alpha, it seems that you can possibly identify a person by searching Bill Clinton Birthday or just Bill Clinton, or you can identify a location by searching New York GPS coordinates or just New York, for even better results. Blackberry seems like a tough word for Alpha, because it keeps wanting to interpret it as a fruit. You might have luck searching Froogle to identify a device.
It seems like WA will give you a fairly decent accuracy, at least if you're using famous people/places.

How about using a search engine? Google would be good, and I think Yahoo! has tools for building your own search.
I googled:
Results 1 - 10 of about 27,100,000 for "bill clinton" person
Results 1 - 10 of about 6,050,000 for "bill clinton" place
Results 1 - 10 of about 601,000 for "bill clinton" device
He's a person!
Results 1 - 10 of about 391,000,000 for "new york" place.
Results 1 - 10 of about 280,000,000 for "new york" person.
Results 1 - 10 of about 84,100,000 for "new york" device.
It's a place!
Results 1 - 10 of about 11,000,000 for "blackberry" person
Results 1 - 10 of about 36,600,000 for "blackberry" place
Results 1 - 10 of about 28,000,000 for "blackberry" device
Unfortunately, blackberry is a place as well. :-/
Note that only in the case of 'blackberry' did "device" even get close. Maybe you need to weight the page hit values. What is your application? Do you have any idea which "devices" you'd have to classify? What is the possible range of inputs?
Maybe you want to combine the results you get from different sources.

I think the basic task you're trying to accomplish is more formally known as named entity recognition. This task is nontrivial, and by only inputting the name stripped of any context, you're making it even harder.
For example, we'd like to think examples such as "Bill Clinton" and "New York" are obviously unambiguous, but looking at their disambiguation pages in Wikipedia shows that there are several potential entities they may refer to. "New York" is both a state, city, and movie title. "Bill Clinton" is a bit less ambiguous if you're only looking at Wikipedia, but I'm sure you'll find dozens of Bill Clintons in any phonebook. It might also be the name of someone's sailboat or pet dog. What if someone inputs "Washington"? That could be both a U.S. President, state, district, city, lake, street, island, movie, one of several U.S. navy ships, bridge, as well as other things. Determining which is the "correct" usage you'd want the webservice to return could become very complicated.
As much as Cyc knows, I think you'll find it's still not as comprehensive as Wikipedia. However, the main downside to Wikipedia is that it's essentially unstructured. Personally, I find Cyc's API so convoluted and poorly documented, that parsing Wikipedia's natural language almost seems easier.
If I had to implement such a webservice from scratch, I'd start by downloading a snapshot of Wikipedia, and then writing a parser that would read through all the articles, and generate a named entity index based on article titles. You could manually "classify" a few dozen examples as person/place/object, and train a classifier (Bayesian,Maxent,SVM) to automatically classify other examples based on the word frequencies of their articles.

Related

Google Cloud VideoIntelligence Speech Transcription - Transcription Size

I use Google Cloud Speech Transcription as following :
video_client = videointelligence.VideoIntelligenceServiceClient()
features = [videointelligence.enums.Feature.SPEECH_TRANSCRIPTION]
operation = video_client.annotate_video(gs_video_path, features=features)
result = operation.result(timeout=3600)
And I present the transcript and store the transcript in Django Objects using PostgreSQL as following :
transcriptions = response.annotation_results[0].speech_transcriptions
for transcription in transcriptions:
best_alternative = transcription.alternatives[0]
confidence = best_alternative.confidence
transcript = best_alternative.transcript
if SpeechTranscript.objects.filter(text = transcript).count() == 0:
SpeechTranscript.objects.create(text = transcript,
confidence = confidence)
print(f"Adding -> {confidence:4.10%} | {transcript.strip()}")
else:
pass
For instance the following is the text that I receive from a sample video :
94.9425220490% | I refuse to where is it short sleeve dress shirt. I'm just not going there the president of the United States is a visit to Walter Reed hospital in mid-july format was the combination of weeks of cajoling by trump staff and allies to get the presents for both public health and political perspective wearing a mask to protect against the spread of covid-19 reported in advance of that watery trip and I quote one presidential aide to the president to set an example for a supporters by wearing a mask and the visit.
94.3865835667% | Mask wearing is because well science our best way to slow the spread of the coronavirus. Yes trump or Matthew or 3 but if you know what he said while doing sell it still anybody's guess about what can you really think about NASCAR here is what probably have a mass give you probably have a hospital especially and that particular setting were you talking to a lot of people I think it's but I do believe it. Have a a time and a place very special line trump saying I've never been against masks but I do believe they have a time and a place that isn't exactly a ringing endorsement for mask wearing.
94.8513686657% | Republican skip this isn't it up to four men over the perfumer's that wine about time and place should be a blinking red warning light for people who think debate over whether last for you for next coronavirus. They are is finally behind us time in a place lined everything you need to know about weird Trump is like headed next time he'll get watery because it was a hospital and will continue to express not so scepticism to wear masks in public house new CDC guidelines recommending that mask to be worn inside and one social this thing is it possible outside he sent this?
92.9862976074% | He wearing a face mask as agreed presidents prime minister's dictators Kings Queens and somehow. I don't see it for myself literally main door he responded this way back backstage, but they said you didn't need it trump went to Michigan to this later and he appeared in which personality approaching Mark former vice president Joe Biden
94.6677267551% | In his microwave fighting for wearing a mask and he walked onto the stage where it is massive mask there's nobody understands and there's any takes it off you like to have it hanging off you. I think it makes them feel good frankly if you want to know the truth who's got the largest basket together. Seen it because trump thinks that maths make him and people generally I guess what a week or something is resistant wearing one in public from 1 today which has had a correlation between the erosion of the public's confidence and trump have the corner coronavirus and his number is SE6 a second term in the 67.
94.9921131134% | The coronavirus pandemic in the heels of national and swings they both lots of them that show trump slipping further and further behind former vice president Joe Biden when it comes to General Election good policy would seem to make for good politics at all virtually every infectious disease expert believes that wearing masks in public is our best to contain the spread of coronavirus until a vaccine would do well to listen to buy on this one a mare is the point we make episode every Tuesday and Thursday make sure to check them all out.
What is the predicted size of a transcript that is generated within the speech transcription results. What decides the size of each transcript ? What is the max and minimum character length ? How should I design my SQL table column size, in order to be prepared for the expected transcript size ?
As I mentioned in the comments, the Video Intelligence transcripts are splits with roughly 50-60 seconds from the video.
I have created a Public Issue Tracker case, link, so the product team can clarify this information within the documentation. Although, I do not have an eta for this request, I encourage you to follow the case's thread.

Gensim: Word2Vec Recommender accuracy Improvement

I am trying to implement something similar in https://arxiv.org/pdf/1603.04259.pdf using awesome gensim library however I am having trouble improving quality of results when I compare to Collaborative Filtering.
I have two models one built on Apache Spark and other one using gensim Word2Vec on grouplens 20 million ratings dataset. My apache spark model is hosted on AWS http://sparkmovierecommender.us-east-1.elasticbeanstalk.com
and I am running gensim model on my local. However when I compare the results I see superior results with CF model 9 out of 10 times(like below example more similar to searched movie - affinity towards Marvel movies)
e.g.:- If I search for Thor movie I get below results
Gensim
Captain America: The First Avenger (2011)
X-Men: First Class (2011)
Rise of the Planet of the Apes (2011)
Iron Man 2 (2010)
X-Men Origins: Wolverine (2009)
Green Lantern (2011)
Super 8 (2011)
Tron:Legacy (2010)
Transformers: Dark of the Moon (2011)
CF
Captain America: The First Avenger
Iron Man 2
Thor: The Dark World
Iron Man
The Avengers
X-Men: First Class
Iron Man 3
Star Trek
Captain America: The Winter Soldier
Below is my model configuration, so far I have tried playing with window, min_count and size parameter but not much improvement.
word2vec_model = gensim.models.Word2Vec(
seed=1,
size=100,
min_count=50,
window=30)
word2vec_model.train(movie_list, total_examples=len(movie_list), epochs=10)
Any help in this regard is appreciated.
You don't mention what Collaborative Filtering algorithm you're trying, but maybe it's just better than Word2Vec for this purpose. (Word2Vec is not doing awful; why do you expect it to be better?)
Alternate meta-parameters might do better.
For example, the window is the max-distance between tokens that might affect each other, but the effective windows used in each target-token training randomly chosen from 1 to window, as a way to give nearby tokens more weight. Thus when some training-texts are much larger than the window (as in your example row), some of the correlations will be ignored (or underweighted). Unless ordering is very significant, a giant window (MAX_INT?) might do better, or even a related method where ordering is irrelevant (such as Doc2Vec in pure PV-DBOW dm=0 mode, with every token used as a doc-tag).
Depending on how much data you have, your size might be too large or small. Different min_count, negative count, greater 'iter'/'epochs', or sample level might work much better. (And perhaps even things you've already tinkered with would only help after other changes are in place.)

Re-Training spaCy's NER v1.8.2 - Training Volume and Mix of Entity Types

I'm in the process of (re-) training spaCy's Named Entity Recognizer and have a couple of doubts that I hope a more experienced researcher/practitioner can help me figure out:
If a few hundred examples are considered 'a good starting point', then what would be a reasonable number to aim for? Is 100 000 entity/label excessive?
If I introduce a new label, is it best if the number of the entities of that labeled are roughly the same (balanced) during training?
Regarding the mixing in 'examples of other entity types':
do I just add random known categories/labels to my training set eg: ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(4,21, 'ORG')], )?
can I use the same text for various labels? e.g. ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(55,64, 'COMMODITY')], )?
on a similar note let's assume I want spaCyto also recognize a second COMMODITY could I then just use the same sentence and label a different region e.g. ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(69,80, 'COMMODITY')], )? Is that how it's supposed to be done?
what ratio between new and other (old) labels is considered reasonable
Thanks
PS I'm working with Python2.7 in Ubuntu 16.04 using spaCy 1.8.2
For a full answer by Matthew Honnibal check out issue 1054 on spaCy's github page. Below are the most important points as they relate to my questions:
Question(Q) 1: If a few hundred examples are considered 'a good starting point', then what would be a reasonable number to aim for? Is 100 000 entity/label excessive?
Answer(A): Every machine learning problem will have a different examples/accuracy curve. You can get an idea for this by training with less data than you have, and seeing what the curve looks like. If you have 1,000 examples, then try training with 500, 750, etc, and see how that affects your accuracy.
Q 2: If I introduce a new label, is it best if the number of the entities of that label are roughly the same (balanced) during training?
A: There's trade-off between making the gradients too sparse, and making the learning problem too unrepresentative of what the actual examples will look like.
Q 3: Regarding the mixing in 'examples of other entity types':
do I just add random known categories/labels to my training set:
A: No, one should annotate all the entities in that text, so the example above: ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(4,21, 'ORG')], ) should be ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(4,21, 'ORG'), (55,64, 'COMMODITY'), (69,80, 'COMMODITY')], )
can I use the same text for various labels?:
A: Not in the way the examples were given. See previous answer.
what ratio between new and other (old) labels is considered reasonable?:
A: See answer Q 2.
PS: Double citations are direct quotes from the github issue answer.

House number parameter in cloudmade doesn't work

I've problem with the house Parameter of the cloudmade geocoding API.
i.e. I'am looking for house number 10 in the German City Gießen:
http://geocoding.cloudmade.com/8ee2a50541944fb9bcedded5165f09d9/geocoding/v2/find.html?query=zipcode:35390;city:giessen;house:10;street:bahnhofstrasse;country:Germany&return_geometry=true&results=1
cloudmade finds the correct street, but number 10 is not in the middle of the street. I've the same problem with all streets.
The API documentation sais I am right: http://developers.cloudmade.com/projects/show/geocoding-http-api#Structured-search
They seem to ignore the house key and just return the same result as if it hasn't been provided. But searching on http://maps.cloudmade.com/ returns the same vague result. While the data you are searching for is in OSM's data base it might not be in CloudMade's data base for whatever reason. Therefore I guess this is a problem with CloudMade's geocoding database and not a problem of their API.
Sometimes there are no house number data in our data set, so we return only street. Also in couple of months we'll launch new geocoding with more detailed data.

Getting stocks by industry via Yahoo Finance

i want to list all available industries ( like: http://biz.yahoo.com/p/ ) and show all corresponding stocks.
Until now I'm using YAHOO.Finance.SymbolSuggest.ssCallback for the symbol suggestion and http://finance.yahoo.com/d/quotes.csv?s=... for getting the stock's data.
Does anyone have any idea how to get all industries and corresponding stocks?
Is there another hidden Yahoo API?
Lists of all available industries are called GICS Sectors for Standard and Poor's (S&P500 will use that) and ICB for Dow Jones and FTSE. Hence it used by Nasdaq, Nyse and others markets.
It seems like Yahoo uses a third industry classification by Morning Star, but since I'm not quite sure I will give both ways of retrieving data.
Morning Star
I don't know if Yahoo really sticks to this classification, but some names were really close so let's see it:
You need to go to their Index Data and in each sector, click on it and then at the bottom View complete index holdings.
It's not as precise as in Yahoo industry list, but it's all you can do with Morning Star. Not very convincing, I know...
GICS Sectors
GICS Sectors are now a trademark of Standard and Poor's and then data have to be sought for in S&P's website.
Short answer: take a look at this page, you will need to be registered (it's free and easy) and you can download spreadsheets (xls) with stocks and corresponding sectors. Nevertheless, things aren't always easy, and you will have to do a bit of a search to retrieve all stocks with their corresponding industries. For example, the file INDICATED_RATE_CHANGE.xls will give you some companies and their sectors in each month of 2012. Using that and SP500_DividendAristocrats_2012.xls you should be able to retrieve at least a large part of S&P 500 companies.
ICB
ICB is used by NYSE, NASDAQ etc... Then it's a lot simpler than S&P and MorningStar. Here is your answer. BOOM! Direct link!
Link is dead :(
Finally
I strongly advise you to use the simpler and most-used industry classification index: the ICB. It will always be available and publicly displayed since millions of investors relay everyday on it, without having to use S&P financial services or MorningStar brokerage services...
EDIT
You can look at nasdaq.com to retrieve all companies and their corresponding sector: here for Nasdaq and here for Nyse
Get all industry-IDs from here:
http://biz.yahoo.com/ic/ind_index.html
(look at the links)
Then use YQL ( https://developer.yahoo.com/yql/console/ )
with a query like this:
select * from yahoo.finance.industry where id=912