Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm building a ruby recipe management application, and as part of it, I want to be able to parse ingredient quantities into a form I can compare and scale. I'm wondering what the best tools are for doing this.
I originally planned on a complex regex, then on some other code that converts human readable numbers like two or five into integers, and finally code that will convert say 1 cup and 3 teaspoons into some base measurement. I control the input, so I kept the actual ingredient separate. However, I noticed users inputting abstract measurements like to taste and 1 package. At least with the abstract measurements, I think I could just ignore them and scale and just scrape any number preceding them.
Here are some more examples
1 tall can
1/4 cup
2 Leaves
1 packet
To Taste
One
Two slices
3-4 fillets
Half-bunch
2 to 3 pinches (optional)
Are there any tricks to this? I have noticed users seem somewhat confused of what constitutes a quantity. I could try to enforce stricter rules and push things like tall can and leaves into the ingredient part. However, in order to enforce that, I need to be able to convey what's invalid.
I'm also not sure what the "base" measurement I should convert quantities into.
These are my goals.
To be able to scale recipes. Arbitrary units of measurement like
packages don't have to be scaled but precise ones like cups or
ounces need to be.
Figure out the "main" ingredients. In the context of this question, this will be done largely by figuring out what the largest ingredient is in the recipe. In production, there will have to be some sort of modifier based on the type of ingredient because, obviously, flour is almost never considered the "main" ingredient. However, chocolate can be used sparingly, and it can still be said a chocolate cake.
Normalize input. To keep some consistency on the site, I want to keep consistent abbreviations. For example, instead of pounds, it should be lbs.
You pose two problems, recognizing/extracting the quantity expressions (syntax) and figuring out what amount they mean (semantics).
Before you figure out whether regexps are enough to recognize the quantities, you should make yourself a good schema (grammar) of what they look like. Your examples look like this:
<amount> <unit> [of <ingredient>]
where <amount> can take many forms:
whole or decimal number, in digits (250, 0.75)
common fraction (3/4)
numeral in words (half, one, ten, twenty-five, three quarters)
determiner instead of a numeral ("an onion")
subjective (some, a few, several)
The amount can also be expressed as a range of two simple <amount>s:
two to three
2 to 3
2-3
five to 10
Then you have the units themselves:
general-purpose measurements (lb, oz, kg, g; pounds, ounces, etc.)
cooking units (Tb, tsp)
informal units (a pinch, a dash)
container sizes (package, bunch, large can)
no unit at all, for countable ingredients (as in "three lemons")
Finally, there's a special case of expressions that can never be combined with either amounts or units, so they effectively function as a combination of both:
a little
to taste
I'd suggest approaching this as a small parser, which you can make as detailed or as rough as you need to. It shouldn't be too hard to write regexps for all of those, if that's your tool of choice, but as you see it's not just a question of textual substitution. Pull the parts out and represent each ingredient as a triple (amount, unit, ingredient). (For countables, use a special unit "pieces" or whatever; for "a little" and the like, I'd treat them as special units).
That leaves the question of converting or comparing the quantities. Unit conversion has been done in lots of places, so at least for the official units you should have no trouble getting the conversion tables. Google will do it if you type "convert 4oz to grams", for example. Note that a Tbsp is either three or four tsp, depending on the country.
You can standardize to your favorite units pretty easily for well-defined units, but the informal units are a little trickier. For "a pinch", "a dash", and the like, I would suggest finding out the approximate weight so that you can scale properly (ten pinches = 2 grams, or whatever). Cans and the like are hopeless, unless you can look up the size of particular products.
On the other hand, subjective amounts are the easiest: If you scale up "to taste" ten times, it's still "to taste"!
One last thought: Some sort of database of ingredients is also needed for recognizing the main ingredients, since size matters: "One egg" is probably not the major ingredient, but "one small goat, quartered" may well be. I would consider it for version 2.
Regular expressions are difficult to get right for natural language parsing. NLTK, like you mentioned, would probably be a good option to look into otherwise you'll find yourself going around in circles trying to get the expressions right.
If you want something of the Ruby variety instead of NLTK, take a look at Treat:
https://github.com/louismullie/treat
Also, the Linguistics framework might be a good option as well:
http://deveiate.org/projects/Linguistics
EDIT:
I figured there had to already be a Ruby recipe parser out there, here's another option you might want to look into:
https://github.com/iancanderson/ingreedy
There is a lot of free training data available out there if you know how to write a good web scraper and parsing tool.
http://allrecipes.com/Recipe/Darias-Slow-Cooker-Beef-Stroganoff - This site seems to let you convert recipe quantities based on metric/imperial system and number of diners.
http://www.epicurious.com/tools/conversions/common - This site seems to have lots of conversion constants.
Some systematic scraping of existing recipe sites which present ingredients, procedures in some structured format (which you can discover by reading the underlying html) will help you build up a really large training data set which will make taking on such a problem much much easier.
When you have tons of data, even simple learning techniques can be pretty useful. Once you have a lot of data, you can use standard nlp tricks (ngrams, tf-idf, naive bayes, etc) to quickly do awesome things.
For example:
Main Ingredient-ness
Ingredients in a dish with a higher idf (inverse document frequency) are more likely to be main ingredients. Every dish mentions salt, so it should have very low idf. A lot fewer dishes mention oil, so it should have a higher idf. Most dishes probably have only one main protein, so phrases like 'chicken', 'tofu', etc should be rarer and much more likely to be main ingredients than salt, onions, oil, etc. Of course there may be items like 'cilantro' which might be rarer than 'chicken', but if you had scraped out some relevant metadata along with every dish, you will have signals that will help you fix this issue as well. Most chefs might not be using cilantro in their recipes, but the ones that do probably use it quite a lot. So for any ingredient name, you can figure out the name's idf by first considering only the authors that have mentioned the ingredient at least once, and then seeing the ingredient's idf on this subset of recipes.
Scaling recipes
Most recipe sites mention how many people does a particular dish serve, and have a separate ingredients list with appropriate quantities for that number of people.
For any particular ingredient, you can collect all the recipes that mention it and see what quantity of the ingredient was prescribed for what number of people. This should tell you what phrases are used to describe quantities for that ingredient, and how the numbers scale. Also you can now collect all the ingredients whose quantities have been described using a particular phrase (e.g. 'slices' -> (bread, cheese, tofu,...), 'cup' -> (rice, flour, nuts, ...)) and look at the most common of these phrases and manually write down how they would scale.
Normalize Input
This does not seem like a hard problem at all. Manually curating a list of common abbreviations and their full forms (e.g 'lbs' -> 'pounds', 'kgs' -> 'kilograms', 'oz' -> 'ounces', etc) should solve 90% of the problem. Adding new contractions to this list whenever you see them should make this list pretty comprehensive after a while.
In summary, I am asking you to majorly increase the size of your data and collect lots of relevant metadata along with each recipe you scrape (author info, food genre, etc), and use all this structured data along with simple NLP/ML tricks to solve most problems you will face while trying to build an intelligent recipe site.
As far as these go:
I'd hard code these up so that if you get more than so many oz, go to cups, if you get mroe than so many cups, go to pints, litters, gallons, etc. I don't know how you can avoid this unless someone already wrote the code to handle this.
If an ingredient is in the title, it's probably the main ingredient. You'll run into issues with "Oatmeal Raisin Cookies" though. As you've stated, flour, milk, etc aren't the main ingrediant. You'll also need to map bacon, pork chop, pork roast all to pork, and Steak, Hamburger, etc to beef possibly.
Again, this is just a look up on the amount of something, you know people are going to have lbs, oz, etc, so try to preempt them and write this as best you can. You might miss some, but as your site grows you'll be able to introduce a new filter.
If you go through all this work, consider releasing it so others don't have to :)
Related
I am new to NLP and feature extraction, i wish to create a machine learning model that can determine the sentiment of stock related social media posts. For feature extraction of my dataset I have opted to use Word2Vec. My question is:
Is it important to train my word2vec model on a corpus of stock related social media posts - the datasets that are available for this are not very large. Should I just use a much larger pretrained word vector ?
The only way to to tell what will work better for your goals, within your constraints of data/resources/time, is to try alternate approaches & compare the results on a repeatable quantititave evaluation.
Having training texts that are properly representative of your domain-of-interest can be quite important. You may need your representation of the word 'interest', for example, to represent that of stock/financial world, rather than the more general sense of the word.
But quantity of data is also quite important. With smaller datasets, none of your words may get great vectors, and words important to evaluating new posts may be missing or of very-poor quality. In some cases taking some pretrained set-of-vectors, with its larger vocabulary & sharper (but slightly-mismatched to domain) word-senses may be a net help.
Because these pull in different directions, there's no general answer. It will depend on your data, goals, limits, & skills. Only trying a range of alternative approaches, and comparing them, will tell you what should be done for your situation.
As this iterative, comparative experimental pattern repeats endlessly as your projects & knowledge grow – it's what the experts do! – it's also important to learn, & practice. There's no authority you can ask for any certain answer to many of these tradeoff questions.
Other observations on what you've said:
If you don't have a large dataset of posts, and well-labeled 'ground truth' for sentiment, your results may not be good. All these techniques benefit from larger training sets.
Sentiment analysis is often approached as a classification problem (assigning texts to bins of 'positive' or 'negative' sentiment, operhaps of multiple intensities) or a regression problem (assigning texts a value on numerical scale). There are many more-simple ways to create features for such processes that do not involve word2vec vectors – a somewhat more-advanced technique, which adds complexity. (In particular, word-vectors only give you features for individual words, not texts of many words, unless you add some other choices/steps.) If new to the sentiment-analysis domain, I would recommend against starting with word-vector features. Only consider adding them later, after you've achieved some initial baseline results without their extra complexity/choices. At that point, you'll also be able to tell if they're helping or not.
Say I have two sentences, which are similar except there is only one different word with opposite meaning. e.g. "I like her" vs. "I hate her".
word2vec is used in my classification project. As far as I know, word2vec seems unable to figure out differences between antonyms. Is there any way to solve this?
Unfortunately, what we consider 'antonyms' are usually quite similar in word2vec coordinate spaces. That's because such words are quite similar in almost all respects – except for the one contrast they emphasize.
And further, to the extent those contrasts may be captured by the word2vec orientations, they will be in many varied directions. The 'hot'-vs-'cold' contrast will be different from the 'light'-vs-'dark' and the 'small'-vs-'big'.
There might be some analytic technique on sets of word-vectors that helps discover antonymic directions/pairs, but I haven't noticed one discussed, especially not anything that's simple/intuitive or applicable to general word-vector sets. (Once you do know words are opposites, as when consulting prior labeled lexicons or analogy questions, then the directions-between-their-word-vectors can be useful in other analysis, like discovering other words that contrast-in-the-same-way, as when solving analogy problems.)
Can you be more specific about your ultimate goal, with more example of the kinds of input you'll have and what specific results you want software to report?
The one example you give, "I like her" vs "I hate her", could be more generally seen as a sentiment classification, and word2vec-powered classifiers can do OK (though far from perfect) on such challenges. That is, with enough labeled training data, a classifier with a lot of examples of "positive" and "negative" texts will tend to learn that 'like' (and similar words) are positive and 'hate' (and similar) are negative, and do OK on other variants of positive/negative statements (excepting more complex constructions, like negations, subtle qualifications, understatement, irony, etc.)
So more info on what exactly you hope to detect/report, and what you've tried and found insufficient, might generate more ideas on how to achieve it.
Hello I am fairly new to word2vec, I wrote a small program to teach myself
import gensim
from gensim.models import Word2Vec
sentence=[['Yellow','Banana'],['Red','Apple'],['Green','Tea']]
model = gensim.models.Word2Vec(sentence, min_count=1,size=300,workers=4)
print(model.similarity('Yellow', 'Banana'))
The similarity came out to be:
-0.048776340629810115
My question is why not is the similarity between banana and yellow closer to 1 like .70 or something. What am I missing? Kindly guide me.
Word2Vec doesn't work well on toy-sized examples – it's the subtle push-and-pull of many varied examples of the same words that moves word-vectors to useful relative positions.
But also, especially, in your tiny tiny example, you've given the model 300-dimensional vectors to work with, and only a 6-word vocabulary. With so many parameters, and so little to learn, it can essentially 'memorize' the training task, quickly becoming nearly-perfect in its internal prediction goal – and further, it can do that in many, many alternate ways, that may not involve much change from the word-vectors random initialization. So it is never forced to move the vectors to a useful position that provides generalized info about the words.
You can sometimes get somewhat meaningful results from small datasets by shrinking the vectors, and thus the model's free parameters, and giving the model more training iterations. So you could try size=2, iter=20. But you'd still want more examples than just a few, and more than a single occurrence of each word. (Even in larger datasets, the vectors for words with just a small number of examples tend to be poor - hence the default min_count=5, which should be increased even higher in larger datasets.)
To really see word2vec in action, aim for a training corpus of millions of words.
I want to create own simple recommendation system, about books. But there are some problems - it's impossible (at least, very hard) for one person to organize training data set for algorithms.
So, are there any free data sets or quizzes with information about people's votes, which books and how much they like?
Second question is about book's parameters. For some item-based predictions is really must have to use book's ratings (e.g. language, average length of words, average number of words in paragraph, i have counted about 30 parameters like those) and their weights (for example, book's language is rated in 1 point, and average length of words in 0.314). So, are there any prepared information about that?
In fact, if i got an answer for first question, i could find solution for second question, but i am sure, that needed information exists.
Also, i am reading Recommender Systems Handbook, it gives full information (with references), but it is hard to read. Can you advise some extra books in this case?
Could you check Books.txt.gz at;
https://snap.stanford.edu/data/web-Amazon.html
which consists book ratings from Amazon. it also has product title, price, review summary etc.
Also bookcrossing dataset might be useful
http://grouplens.org/datasets/book-crossing/
I guess your second question is a feature selection problem and weights will be different for each dataset.
this course at coursera gives brief information for recommendation systems and it also has a reading part. unfortunately quizzes are no longer available
course:https://www.coursera.org/course/recsys
readings: http://recsys.cs.umn.edu/readings.html
Edit:
Yet another dataset for books.
Goodbooks:
http://fastml.com/goodbooks-10k-a-new-dataset-for-book-recommendations/
This dataset is about movies rather than books, but you might find the Netflix Prize dataset useful as a way of testing recommendation algorithms. The underlying issues are the same with both datasets : needing out-of-band features, having to combine features with different weights, etc.
As for extra books to read, I recommend "Programming Collective Intelligence." I found it to be clearly written and very helpful. It also includes code for all of the example algorithms.
this is a follow up to my recent question ( Code for identifying programming language in a text file ). I'm really thankful for all the answers I got, it helped me very much. My code for this task is complete and it works fairly well - quick and reasonably accurate.
The method i used is the following: i have a "learning" perl script that identifies most frequently used words in a language by doing a word histogram over a set of sample files. These data are then loaded by the c++ program which then checks the given text and accumulates score for each language based on found words and then simply checks which language accumulated the highest score.
Now i would like to make it even better and work a bit on the quality of identification. The problem is I often get "unknown" as result (many languages accumulate a small score, but none anything bigger than my threshold). After some debugging, research etc i found out that this is probably due to the fact, that all words are considered equal. This means that seeing a "#include" for example has the same effect as seeing a "while" - both of which indicate that it might be c/c++ (i'm now ignoring the fact that "while" is used in many other languages), but of course in larger .cpp files there might be a ton of "while" but most of the time only a few "#include".
So the fact that a "#include" is more important is ignored, because i could not come up with a good way how to identify if a word is more important than another. Now bear in mind that the script which creates the data is fairly stupid, its only a word histogram and for every chosen word it assigns a score of 1. It does not even look at the words (so if there is a "#&|?/" in a file very often it might get chosen as a good word).
Also i would like to have the data creation part fully automated, so nobody should have to look at the data and alter them, change scores, change words etc. All the "brainz" should be in the script and the cpp program.
Does somebody have a suggestion how to identify keywords, or more generally, important words? Some things that might help: i have the number of occurences of each word and the number of total words (so a ratio may be calculated). I have also thought about wiping out characters like ;, etc. since the histogram script often puts for example "continue;" in the result, but the important word is "continue". Last note: all checks for equality are done for exact match - no substrings, case sensitive. This is mainly because of speed, but substrings might help (or hurt, i dont know)...
NOTE: thanks all who bothered to answer, you helped me a lot.
My work with this is almost finished so i will describe what i did to get good results.
1) Get a decent training set, about 30-50 files per language from various sources to avoid coding style bias
2) Write a perl script that does a word histogram. Implement blacklist and whitelist (more about it below)
3) add bogus words to blacklist, like "license", "the" etc. These are often found at the start of file in license information.
4) add about five most important words per language to the whitelist. These are words that are found in most source code of a given language, but are not frequent enough to get into the histogram. For example for C/C++ i had: #include, #define, #ifdef, #ifndef and #endif in the whitelist.
5) Emphasize the start of a file, so give more points to words found in the first 50-100 lines
6) when doing the word histogram, tokenize the file using #words = split(/[\s\(\){}\[\];.,=]+/, $_); This should be ok for most languages i think (gives me the best results). For each language, have about 10-20 most frequent words in the final results.
7) When the histogram is complete, remove all words that are found in the blacklist and add all those that are found in the whitelist
8) Write a program which processes a text file in the same way as the script - tokenize using the same rules. If a word is found in the histogram data, add points to the right language. Words in the histogram which correspond to only one language should add more points, those which belong to multiple languages should add less.
Comments are welcome. Currently on about 1000 text files i get 80 unknowns (mostly on extremely short files - mainly javascript with just one or two lines). About 20 files are recognized wrong. Size of the files is about 11kB ranging from 100 bytes to 100 kBytes (almost 11MB total). It takes one second to process them all, which is good enough for me.
I think you're approaching this from the wrong viewpoint. From your description, it sounds like you are building a classifier. A good classifier needs to discriminate between different classes; it doesn't need to precisely estimate the correspondence between the input and the most likely class.
Practically: your classifier doesn't need to assess precisely how close to C++ a certain input is; it merely needs to determine if the input is more like C than C++. This makes your work a lot easier - most of your current "unknown" cases will be close to one or two languages, even though they don't exceed your basic threshold.
Now, once you realize this, you will also see what training your classifier needs: not some random aspect of the sample files, but what sets two languages apart. Hence, when you have parsed your C samples, and your C++ samples, you will see that #include does not set them apart. However, class and template will be far more common in C++. On the other hand, #include does distinguish between C++ and Java.
There are of course other aspects besides keywords that you can use. For instance, the most obvious would be the frequency of {, and ; is similarly distinguishing. Another very useful feature for your classifier would be the comment tokens for the different languages. The basic problem of course would be automatically identifying them. Again, hardcoding //, /*, ', --, # and ! as pseudo-keywords would help.
This also identifies another classification rule: SQL will often have -- at the beginning of a line, whereas in C it will often appear somewhere else. Thus it may be useful for your classifier to take the context into account as well.
Use Google Code Search to learn weights for the set of keywords: #include in C++ gets 672.000 hits, in Python only ~5000.
You can normalize the results by looking at the number of results for the language in total:
C++ gives about 770.000 files whereas Python returns 120.000.
Thus "#include" is extremely rare in Python files, but exists in almost every C++ file. (Now you still have to learn to distinguish C++ and C of course.) All that is left is to do the correct reasoning about probabilities.
You need to get some exclusiveness into your lookup data.
When teaching the programming languages you expect, you should search for words typical for one or few language(s). If a word appears in several code files of the same language but appears in few or none of the other language files, it's a strong suggestion to that language.
So the score of a word could be calculated at the lookup side by selecting the words that are exclusive to a language or a group of languages. Find several of these words and get the intersection of these by adding the scores, and found your language you will have.
In an answer to your other question, someone recommended a naïve Bayes classifier. You should implement this suggestion because the technique is good at separating according to distinguishing features. You mentioned the while keyword, but that's not likely to be useful because so many languages use it—and a Bayes classifier won't treat it as useful.
An interesting part of your problem is how to tokenize an unknown program. Whitespace-separated chunks is a decent rough start, but going meaningfully beyond that will be tricky.