I am completely new to the field of Data mining and WEKA tool (just installed it today).
I need to do topic identification based on short text sentences.
Let say I have several categories:
- politics
- sports
- other
I am thinking of doing the following:
Have a list of terms that I compare the text to:
Sports:
NFL
NBA
Touch down
etc
Politics:
election
president
OBama
etc
Also, I would like to add more categories.
Then I would apply some algorithm SVM or Naive Bayes with the help of WEKA.
Any idea on how to start doing this with WEKA?
I have searched some tutorials on WEKA but I can't seem to get any examples similar to what I am trying to do.
Any help to start me up will be appreciated.
Related
I am looking for a good approach using python libraries to tackle the following problem:
I have a dataset with a column that has product description. The values in this column can be very messy and would have a lot of other words that are not related to the product. I want to know which rows are about the same product, so I would need to tag each description sentence with its main topics. For example, if I have the following:
"500 units shoe green sport tennis import oversea plastic", I would like the tags to be something like: "shoe", "sport". So I am looking to build an approach for semantic tagging of sentences, not part of speech tagging. Assume I don't have labeled (tagged) data for training.
Any help would be appreciated.
Lack of labeled data means you cannot apply any semantic classification method using word vectors, which would be the optimal solution to your problem. An alternative however could be to construct the document frequencies of your token n-grams and assume importance based on some smoothed variant of idf (i.e. words that tend to appear often in descriptions probably carry some semantic weight). You can then inspect your sorted-by-idf list of words and handpick(/erase) words that you deem important(/unimportant). The results won't be perfect, but it's a clean and simple solution given your lack of training data.
I'm in the process of (re-) training spaCy's Named Entity Recognizer and have a couple of doubts that I hope a more experienced researcher/practitioner can help me figure out:
If a few hundred examples are considered 'a good starting point', then what would be a reasonable number to aim for? Is 100 000 entity/label excessive?
If I introduce a new label, is it best if the number of the entities of that labeled are roughly the same (balanced) during training?
Regarding the mixing in 'examples of other entity types':
do I just add random known categories/labels to my training set eg: ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(4,21, 'ORG')], )?
can I use the same text for various labels? e.g. ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(55,64, 'COMMODITY')], )?
on a similar note let's assume I want spaCyto also recognize a second COMMODITY could I then just use the same sentence and label a different region e.g. ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(69,80, 'COMMODITY')], )? Is that how it's supposed to be done?
what ratio between new and other (old) labels is considered reasonable
Thanks
PS I'm working with Python2.7 in Ubuntu 16.04 using spaCy 1.8.2
For a full answer by Matthew Honnibal check out issue 1054 on spaCy's github page. Below are the most important points as they relate to my questions:
Question(Q) 1: If a few hundred examples are considered 'a good starting point', then what would be a reasonable number to aim for? Is 100 000 entity/label excessive?
Answer(A): Every machine learning problem will have a different examples/accuracy curve. You can get an idea for this by training with less data than you have, and seeing what the curve looks like. If you have 1,000 examples, then try training with 500, 750, etc, and see how that affects your accuracy.
Q 2: If I introduce a new label, is it best if the number of the entities of that label are roughly the same (balanced) during training?
A: There's trade-off between making the gradients too sparse, and making the learning problem too unrepresentative of what the actual examples will look like.
Q 3: Regarding the mixing in 'examples of other entity types':
do I just add random known categories/labels to my training set:
A: No, one should annotate all the entities in that text, so the example above: ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(4,21, 'ORG')], ) should be ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(4,21, 'ORG'), (55,64, 'COMMODITY'), (69,80, 'COMMODITY')], )
can I use the same text for various labels?:
A: Not in the way the examples were given. See previous answer.
what ratio between new and other (old) labels is considered reasonable?:
A: See answer Q 2.
PS: Double citations are direct quotes from the github issue answer.
I have unstructured twitter data which is retrieved by the apache flume and stored it into the HDFS. So now I want to convert this unstructured data into structured one using the mapreduce.
Task wanted to do using the mapreduce:
1. conversion Unstructured to structure one.
2. I just want the text part which contain tweet part.
3. I want to identify the tweets for particular topic and grouped according to their sub part.
e.g. I have tweets of samsung handset so i want to make a group according to their handsets like groups of Samsung Note 4, Samsung galaxy etc.
It is my college project so my guide suggested me to use k means algorithm, I search a lot on k means but failed to understand how to identifies the Centroid for this basically i failed to understand how to apply K means to this situation in MapReduce.
Please gude me if I am doing wrong as I am new to this concept
K-means is clustering algorithm. It cluster or group similar data and calculate the common centroid. You can create time-series for the above questions you have mention. Group the tweets according to the topic.
K-mean implementation in MapReduce.
https://github.com/himank/K-Means
Using K-means in Twitter datasets.
You can check the following links
https://github.com/JulianHill/R-Tutorials/blob/master/r_twitter_cluster.r
http://www.r-bloggers.com/cluster-your-twitter-data-with-r-and-k-means/
http://rstudio-pubs-static.s3.amazonaws.com/5983_af66eca6775f4528a72b8e243a6ecf2d.html
i want to list all available industries ( like: http://biz.yahoo.com/p/ ) and show all corresponding stocks.
Until now I'm using YAHOO.Finance.SymbolSuggest.ssCallback for the symbol suggestion and http://finance.yahoo.com/d/quotes.csv?s=... for getting the stock's data.
Does anyone have any idea how to get all industries and corresponding stocks?
Is there another hidden Yahoo API?
Lists of all available industries are called GICS Sectors for Standard and Poor's (S&P500 will use that) and ICB for Dow Jones and FTSE. Hence it used by Nasdaq, Nyse and others markets.
It seems like Yahoo uses a third industry classification by Morning Star, but since I'm not quite sure I will give both ways of retrieving data.
Morning Star
I don't know if Yahoo really sticks to this classification, but some names were really close so let's see it:
You need to go to their Index Data and in each sector, click on it and then at the bottom View complete index holdings.
It's not as precise as in Yahoo industry list, but it's all you can do with Morning Star. Not very convincing, I know...
GICS Sectors
GICS Sectors are now a trademark of Standard and Poor's and then data have to be sought for in S&P's website.
Short answer: take a look at this page, you will need to be registered (it's free and easy) and you can download spreadsheets (xls) with stocks and corresponding sectors. Nevertheless, things aren't always easy, and you will have to do a bit of a search to retrieve all stocks with their corresponding industries. For example, the file INDICATED_RATE_CHANGE.xls will give you some companies and their sectors in each month of 2012. Using that and SP500_DividendAristocrats_2012.xls you should be able to retrieve at least a large part of S&P 500 companies.
ICB
ICB is used by NYSE, NASDAQ etc... Then it's a lot simpler than S&P and MorningStar. Here is your answer. BOOM! Direct link!
Link is dead :(
Finally
I strongly advise you to use the simpler and most-used industry classification index: the ICB. It will always be available and publicly displayed since millions of investors relay everyday on it, without having to use S&P financial services or MorningStar brokerage services...
EDIT
You can look at nasdaq.com to retrieve all companies and their corresponding sector: here for Nasdaq and here for Nyse
Get all industry-IDs from here:
http://biz.yahoo.com/ic/ind_index.html
(look at the links)
Then use YQL ( https://developer.yahoo.com/yql/console/ )
with a query like this:
select * from yahoo.finance.industry where id=912
How can I implement yelp like search?
There are 2 types of searches on yelp.
Simple search using the zip code, city and state in U.S.
I'm using PostgreSQL and wonder if there is good dataset that I can use that has city, state and zip code. I was hoping to find a good geo shape file and use geoDjango where I can just use, say Store.objects.filter(coordinates__in=cityNameORZipCode).
There seem to be some zip code database that I can use, but I really don't know where I can find a good city, state. The last option is to create my own cityname and state table and link to Stores, but not sure if this is smart thing to do.....hm.
Yelp has map search.
If you zoom in or out the google map, it searches local businesses according to the map area you are viewing. Think this is amazing. How can I do this?
It's looking dark right now. Please shed me some light.
You're asking a very broad and unanswerable question, but a good place to start for data in the U.S. is at the Census Bureau. For example:
State and State Equivalent Areas
County and County Equivalent Areas
The full list:
http://www.census.gov/geo/www/cob/bdy_files.html