What is the best data mining method for vehicle search? - data-mining
I'm trying to build a search engine that goes through online vehicle classifieds such as Oodle, eBay motors, and craigslist. I also have a large database of standard vehicle names and specifications about them. What I would like to do is for each record that I find through the classified site, be able to determine exactly what vehicle model, style it is (from my database). For example, a standard name for a ford truck in my db is:
2003 Ford F150.
However on classified sites, people might refer to is as: "2003 Ford F 150" or "2003 Ford f-150" or "03 Ford truck 150". Is there an effective data mining/text classification algorithm to be able to normalize these texts to the standard name above?
You could use the Levenshtein distance to match the found string against your database records.
Another (probably better) idea is to tokenize the strings and use a term vector model for the vehicle names. This way you can use cosine similarity to find relevant matches.
If you're gonna develop a whole search engine intended to scale in both, usage and size, you will need something robust to support your queries.
If you're gonna used edit distance, Bed-trees provide a good alternative for your index structure. Another good approach, depending on the size of your dataset, is to use a Levenshtein automata. Levenshtein automatas are also great at providing auto-complete functionalities, which you may need since you're developing a search engine.
Another approach to edit distance is to use n-grams combined with Jaccard index. For this approach you can use Minhash + LSH. Also, you can use Jaccard as a distance metric (1 - Jaccard index) which respects the triangle inequality, thus, can be used in a metric tree such as a VP-tree.
One of these approaches will certainly help you.
Related
calculate nearest document using fasttext or word2vec
i have a small system of about 1000 documents. For each document I would like to show links to the X "most similar" documents. However, the documents are not labeled in any way, so this would be some kind of unsupervised method. It feels like fasttext would be a good candidate, but I cant wrap my head around how to do it when its not labeled data. I can calculate the word vectors, although what I really need is a vector for the whole document.
The Paragraph Vector algorithm, known as Doc2Vec in libraries like Python gensim, can train a model that will give a single vector for a run-of-text, and so could be useful for your need. Note, though, that typical published work uses tens-of-thousands to millions of documents. (Just 1,000 would be a very small training set.) You can also simply average all the word-vectors of a text together (perhaps in some weighted fashion) to get a simple, crude vector for the full text, that will often somewhat work for this purpose. (You could use word-vectors from classi word2vec or FastText for this purpose.) Similarly, if you have word-vectors but not full doc-vectors, there's a technique called "Word Mover's Distance" that calculates a word-vector-adjusted "distance" between two texts. It often does well in highlighting near-paraphrases, though it's somewhat expensive to calculate (especially for longer texts). In some cases, just converting all docs to their "bag of words" representation – a giant vector containing counts of words used – then ranking docs by how many words they share is a good enough similarity measure. Also, full-text index/search frameworks, like SOLR or ElasticSearch, can sometimes take full documents as queries, giving nicly ranked results. (This often works by picking the example document's most significant words, and using those words as fuzzy full-text queries against the full document set.)
Clear approach for assigning semantic tags to each sentence (or short documents) in python
I am looking for a good approach using python libraries to tackle the following problem: I have a dataset with a column that has product description. The values in this column can be very messy and would have a lot of other words that are not related to the product. I want to know which rows are about the same product, so I would need to tag each description sentence with its main topics. For example, if I have the following: "500 units shoe green sport tennis import oversea plastic", I would like the tags to be something like: "shoe", "sport". So I am looking to build an approach for semantic tagging of sentences, not part of speech tagging. Assume I don't have labeled (tagged) data for training. Any help would be appreciated.
Lack of labeled data means you cannot apply any semantic classification method using word vectors, which would be the optimal solution to your problem. An alternative however could be to construct the document frequencies of your token n-grams and assume importance based on some smoothed variant of idf (i.e. words that tend to appear often in descriptions probably carry some semantic weight). You can then inspect your sorted-by-idf list of words and handpick(/erase) words that you deem important(/unimportant). The results won't be perfect, but it's a clean and simple solution given your lack of training data.
word2vec guesing word embeddings
can word2vec be used for guessing words with just context? having trained the model with a large data set e.g. Google news how can I use word2vec to predict a similar word with only context e.g. with input ", who dominated chess for more than 15 years, will compete against nine top players in St Louis, Missouri." The output should be Kasparov or maybe Carlsen. I'ven seen only the similarity apis but I can't make sense how to use them for this? is this not how word2vec was intented to use?
It is not the intended use of word2vec. The word2vec algorithm internally tries to predict exact words, using surrounding words, as a roundabout way to learn useful vectors for those surrounding words. But even so, it's not forming exact predictions during training. It's just looking at a single narrow training example – context words and target word – and performing a very simple comparison and internal nudge to make its conformance to that one example slightly better. Over time, that self-adjusts towards useful vectors – even if the predictions remain of wildly-varying quality. Most word2vec libraries don't offer a direct interface for showing ranked predictions, given context words. The Python gensim library, for the last few versions (as of current version 2.2.0 in July 2017), has offered a predict_output_word() method that roughly shows what the model would predict, given context-words, for some training modes. See: https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.predict_output_word However, considering your fill-in-the-blank query (also called a 'cloze deletion' in related education or machine-learning contexts): _____, who dominated chess for more than 15 years, will compete against nine top players in St Louis, Missouri A vanilla word2vec model is unlikely to get that right. It has little sense of the relative importance of words (except when some words are more narrowly predictive of others). It has no sense of grammar/ordering, or or of the compositional-meaning of connected-phrases (like 'dominated chess' as opposed to the separate words 'dominated' and 'chess'). Even though words describing the same sorts of things are usually near each other, it doesn't know categories to be able to determine that the blank must be a 'person' and a 'chess player', and the fuzzy-similarities of word2vec don't guarantee words-of-a-class will necessarily all be nearer-each-other than other words. There has been a bunch of work to train word/concept vectors (aka 'dense embeddings') to be better at helping at such question-answering tasks. A random example might be "Creating Causal Embeddings for Question Answering with Minimal Supervision" but queries like [word2vec question answering] or [embeddings for question answering] will find lots more. I don't know of easy out-of-the-box libraries for doing this, with or without a core of word2vec, though.
Custom word weights for sentences when calling h2o transform and word2vec, instead of straight AVERAGE of words
I am using H2O machine learning package to do natural language predictions, including the functions h2o.word2vec and h2o.transform. I need sentence level aggregation, which is provided by the AVERAGE parameter value: h2o.transform(word2vec, words, aggregate_method = c("NONE", "AVERAGE")) However, in my case I strongly wish to avoid equal weighting of "the" and "platypus" for example. Here's a scheme I concocted to achieve custom word-weightings. If H2O's word2vec "AVERAGE" option uses all the words including duplicates that might appear, then I could effect a custom word weighting when calling h2o.transform by adding additional duplicates of certain words to my sentences, when I want to weight them more heavily than other words. Can any H2O experts confirm that that the word2vec AVERAGE parameter is using all the words rather than just the unique words when computing AVERAGE of the words in sentence? Alternatively, is there a better way? I tried but I find myself unable to imagine any correct math to multiply the sentence average by some factor, after it was already computed.
Yes, h2o.transform will consider each occurrence of a word for the averaging, not just the unique words. Your trick will therefore work. There is currently no direct way to provide user defined weights. You could probably do an ugly hack and weight directly the word embeddings but that won't be a straightforward solution I could recommend. We can add this feature to H2O. I would love to hear what API would work for you (how would you like to provide the weights).
Improving classification results with Weka J48 and Naive Bayes Multinomial classifiers
I have been using Weka’s J48 and Naive Bayes Multinomial (NBM) classifiers upon frequencies of keywords in RSS feeds to classify the feeds into target categories. For example, one of my .arff files contains the following data extracts: #attribute Keyword_1_nasa_Frequency numeric #attribute Keyword_2_fish_Frequency numeric #attribute Keyword_3_kill_Frequency numeric #attribute Keyword_4_show_Frequency numeric … #attribute RSSFeedCategoryDescription {BFE,FCL,F,M, NCA, SNT,S} #data 0,0,0,34,0,0,0,0,0,40,0,0,0,0,0,0,0,0,0,0,24,0,0,0,0,13,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE 0,0,0,12,0,0,0,0,0,20,0,0,0,0,0,0,0,0,0,0,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE 0,0,0,10,0,0,0,0,0,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE … 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,FCL 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,F … 20,0,64,19,0,162,0,0,36,72,179,24,24,47,24,40,0,48,0,0,0,97,24,0,48,205,143,62,7 8,0,0,216,0,36,24,24,0,0,24,0,0,0,0,140,24,0,0,0,0,72,176,0,0,144,48,0,38,0,284, 221,72,0,72,0,SNT 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,SNT 0,0,0,0,0,0,11,0,0,0,0,0,0,0,19,0,0,0,0,0,0,0,0,0,0,10,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,17,0,0,0,0,0,0,0,0,0,0,0,0,0,20,0,S And so on: there’s a total of 570 rows where each one is contains with the frequency of a keyword in a feed for a day. In this case, there are 57 feeds for 10 days giving a total of 570 records to be classified. Each keyword is prefixed with a surrogate number and postfixed with ‘Frequency’. I am using 10 fold x validation for both the J48s and NBM classifiers on a 'black box' basis. Other parameters used are also defaults, i.e. 0.25 confidence and min number of objects is 2 for the J48s. So far, my classification rates for an instance of varying numbers of days, date ranges and actual keyword frequencies with both J28 and NBM results being consistent in the 50 - 60% range. But, I would like to improve this if possible. I have reduced the decision tree confidence level, sometimes as low as 0.1 but the improvements are very marginal. Can anyone suggest any other way of improving my results? To give more information, the basic process here involves a diverse collection of RSS feeds where each one belongs to a single category. For a given date range, e.g. 01 - 10 Sep 2011, the text of each feed's item elements are combined. The text is then validated to remove words with numbers, accents and so on, and stop words (a list of 500 stop words from MySQL is used). The remaining text is then indexed in Lucene to work out the most popular 64 words. Each of these 64 words is then searched for in the description elements of the feeds for each day within the given date range. As part of this, the description text is also validated in the same way as the title text and again indexed by Lucene. So a popular keyword from the title such as 'declines' is stemmed to 'declin': then if any similar words are found in the description elements which also stem to 'declin', such as 'declined', the frequency for 'declin' is taken from Lucene's indexing of the word from the description elements. The frequencies shown in the .arff file match on this basis, i.e. on the first line above, 'nasa', 'fish', 'kill' are not found in the description items of a particular feed in the BFE category for that day, but 'show' is found 34 times. Each line represents occurrences in the description items of a feed for a day for all 64 keywords. So I think that the low frequencies are not due to stemming. Rather I see it as the inevitable result of some keywords being popular in feeds of one category, but which don't appear in other feeds at all. Hence the spareness shown in the results. Generic keywords may also be pertinent here as well. The other possibilities are differences in the numbers of feeds per category where more feeds are in categories like NCA than S, or the keyword selection process itself is at fault.
You don't mention anything about stemming. In my opinion you could have better results if you were performing word stemming and the WEKA evaluation was based on the keyword stems. For example let's suppose that your WEKA model is built given a keyword surfing and a new rss feed contains the word surf. There should be a match between these two words. There are many free available stemmers for several languages. For the English language some available options for stemming are: The Porter's stemmer Stemming based on the WordNet's dictionary In case you would like to perform stemming using the WordNet's dictionary, there are libraries & frameworks that perform integration with WordNet. Below you can find some of them: MIT Java WordNet interface (JWI) Rita Java WorNet Library (JWNL) EDITED after more information was provided I believe that the keypoint in the specified case is the selection of the "most popular 64 words". The selected words or phrases should be keywords or keyphrases. So the challenge here is the keywords or keyphrases extraction. There are several books, papers and algorithms written about keywords/keyphrases extraction. The university of Waikato has implemented in JAVA, a famous algorithm called Keyword Extraction Algorithm (KEA). KEA extracts keyphrases from text documents and can be either used for free indexing or for indexing with a controlled vocabulary. The implementation is distributed under the GNU General Public License. Another issue that should be taken into consideration is the (Part of Speech)POS tagging. Nouns contain more information than the other POS tags. Therefore may you would have better results if you were checking the POS tag and the selected 64 words were mostly nouns. In addition according to the Anette Hulth's published paper Improved Automatic Keyword Extraction Given More Linguistic Knowledge, her experiments showed that the keywords/keyphrases mostly have or are contained in one of the following five patterns: ADJECTIVE NOUN (singular or mass) NOUN NOUN (both sing. or mass) ADJECTIVE NOUN (plural) NOUN (sing. or mass) NOUN (pl.) NOUN (sing. or mass) In conclusion a simple action that in my opinion could improve your results is to find the POS tag for each word and select mostly nouns in order to evaluate the new RSS feeds. You can use WordNet in order to find the POS tag for each word and as I mentioned above there are many libraries on the web that perform integration with the WordNet's dictionary. Of course stemming is also essential for the classification process and has to be maintained. I hope this helps.
Try turning off stemming altogether. The Stanford Intro to IR authors provide a rough justification of why stemming hurts, and at the very least does not help, in text classification contexts. I have tested stemming myself on a custom multinomial naive Bayes text classification tool (I get accuracies of 85%). I tried the 3 Lucene stemmers available from org.apache.lucene.analysis.en version 4.4.0, which are EnglishMinimalStemFilter, KStemFilter and PorterStemFilter, plus no stemming, and I did the tests on small and larger training document corpora. Stemming significantly degraded classification accuracy when the training corpus was small, and left accuracy unchanged for the larger corpus, which is consistent with the Intro to IR statements. Some more things to try: Why only 64 words? I would increase that number by a lot, but preferably you would not have a limit at all. Try tf-idf (term frequency, inverse document frequency). What you're using now is just tf. If you multiply this by idf you can mitigate problems arising from common and uninformative words like "show". This is especially important given that you're using so few top words. Increase the size of the training corpus. Try shingling to bi-grams, tri-grams, etc, and combinations of different N-grams (you're now using just unigrams). There's a bunch of other knobs you could turn, but I would start with these. You should be able to do a lot better than 60%. 80% to 90% or better is common.