Data set for recommendation system - data-mining

I want to create own simple recommendation system, about books. But there are some problems - it's impossible (at least, very hard) for one person to organize training data set for algorithms.
So, are there any free data sets or quizzes with information about people's votes, which books and how much they like?
Second question is about book's parameters. For some item-based predictions is really must have to use book's ratings (e.g. language, average length of words, average number of words in paragraph, i have counted about 30 parameters like those) and their weights (for example, book's language is rated in 1 point, and average length of words in 0.314). So, are there any prepared information about that?
In fact, if i got an answer for first question, i could find solution for second question, but i am sure, that needed information exists.
Also, i am reading Recommender Systems Handbook, it gives full information (with references), but it is hard to read. Can you advise some extra books in this case?

Could you check Books.txt.gz at;
https://snap.stanford.edu/data/web-Amazon.html
which consists book ratings from Amazon. it also has product title, price, review summary etc.
Also bookcrossing dataset might be useful
http://grouplens.org/datasets/book-crossing/
I guess your second question is a feature selection problem and weights will be different for each dataset.
this course at coursera gives brief information for recommendation systems and it also has a reading part. unfortunately quizzes are no longer available
course:https://www.coursera.org/course/recsys
readings: http://recsys.cs.umn.edu/readings.html
Edit:
Yet another dataset for books.
Goodbooks:
http://fastml.com/goodbooks-10k-a-new-dataset-for-book-recommendations/

This dataset is about movies rather than books, but you might find the Netflix Prize dataset useful as a way of testing recommendation algorithms. The underlying issues are the same with both datasets : needing out-of-band features, having to combine features with different weights, etc.
As for extra books to read, I recommend "Programming Collective Intelligence." I found it to be clearly written and very helpful. It also includes code for all of the example algorithms.

Related

Sentiment analysis feature extraction

I am new to NLP and feature extraction, i wish to create a machine learning model that can determine the sentiment of stock related social media posts. For feature extraction of my dataset I have opted to use Word2Vec. My question is:
Is it important to train my word2vec model on a corpus of stock related social media posts - the datasets that are available for this are not very large. Should I just use a much larger pretrained word vector ?
The only way to to tell what will work better for your goals, within your constraints of data/resources/time, is to try alternate approaches & compare the results on a repeatable quantititave evaluation.
Having training texts that are properly representative of your domain-of-interest can be quite important. You may need your representation of the word 'interest', for example, to represent that of stock/financial world, rather than the more general sense of the word.
But quantity of data is also quite important. With smaller datasets, none of your words may get great vectors, and words important to evaluating new posts may be missing or of very-poor quality. In some cases taking some pretrained set-of-vectors, with its larger vocabulary & sharper (but slightly-mismatched to domain) word-senses may be a net help.
Because these pull in different directions, there's no general answer. It will depend on your data, goals, limits, & skills. Only trying a range of alternative approaches, and comparing them, will tell you what should be done for your situation.
As this iterative, comparative experimental pattern repeats endlessly as your projects & knowledge grow – it's what the experts do! – it's also important to learn, & practice. There's no authority you can ask for any certain answer to many of these tradeoff questions.
Other observations on what you've said:
If you don't have a large dataset of posts, and well-labeled 'ground truth' for sentiment, your results may not be good. All these techniques benefit from larger training sets.
Sentiment analysis is often approached as a classification problem (assigning texts to bins of 'positive' or 'negative' sentiment, operhaps of multiple intensities) or a regression problem (assigning texts a value on numerical scale). There are many more-simple ways to create features for such processes that do not involve word2vec vectors – a somewhat more-advanced technique, which adds complexity. (In particular, word-vectors only give you features for individual words, not texts of many words, unless you add some other choices/steps.) If new to the sentiment-analysis domain, I would recommend against starting with word-vector features. Only consider adding them later, after you've achieved some initial baseline results without their extra complexity/choices. At that point, you'll also be able to tell if they're helping or not.

Best way to feature select using PCA (discussion)

Terminology:
Component: PC
loading-score[i,j]: the j feature in PC[i]
Question:
I know the question regarding feature selection is asked several times here at StackOverflow (SO) and on other tech-pages, and it proposes different answers/discussion. That is why I want to open a discussion for the different solutions, and not post it as a general question since that has been done.
Different methods are proposed for feature selection using PCA: For instance using the dot product between the original features and the components (here) to get their correlation, a discussion at SO here suggests that you can only talk about important features as loading-scores in a component (and not use that importance in the input space), and another discussion at SO (which I cannot find at the moment) suggest that the importance for feature[j] would be abs(sum(loading_score[:,j]) i.e the sum of the absolute value of loading_score[i,j] for all i components.
I personally would think that a way to get the importance of a feature would be an absolute sum where each loading_score[i,j] is weighted by the explained variance of component i i.e
imp_feature[j]=sum_i (abs(loading_score[i,j])*explained_variance[i].
Well, there is no universal way to select features; it totally depends on the dataset and some insights available about the dataset. I will provide some examples which might be helpful.
Since you asked about PCA, initially it separates the whole dataset into two sets under which the variances. On the other ICA (Independent Component Analysis) is able to extract multiple features simultaneously. Look at this example,
In this example, we mix three independent signals and try to separate out them using ICA and PCA. In this case, ICA is doing it a better way than PCA. In general, if you search Blind Souce Separation (BSS) you may find more information about this. Besides, in this example, we know the independent components thus, separation is easy. In general, we do not know the number of components. Hence, you may have to guess based on some prior information about the dataset. Also, you may use LDA (Linear Discriminate Analysis) to reduce the number of features.
Once you extract PC components using any of the techniques, following way we can visualize it. If we assume, those extracted components as random variables i.e., x, y, z
More information about you may refer to this original source where I took about two figures.
Coming back to your proposition,
imp_feature[j]=sum_i (abs(loading_score[i,j])*explained_variance[i]
I would not recommend this way due to the following reasons:
abs(loading_score[i,j]) when we get absolute values you may loose positive or negative correlations of considered features. explained_variance[i] may be used to find the correlation between features, but multiplying does not make any sense.
Edit:
In PCA, each component has its explained variance. Explained variance is the ratio between individual component variance and total variance(sum of all individual components variances). Feature significance can be measured by magnitude of explained variance.
All in all, what I want to say, feature selection totally depends on the dataset and the significance of features. PCA is just one technique. Frist understand the properties of features and the dataset. Then, try to extract features. Hope this helps. If you can provide us with an exact example, we may provide more insights.

Why don't we use word ranks for string compression?

I have 3 main questions:
Let's say I have a large text file. (1)Is replacing the words with their rank an effective way to compress the file? (Got answer to this question. This is a bad idea.)
Also, I have come up with a new compression algorithm. I read some existing compression models that are used widely and I found out they use some pretty advanced concepts like statistical redundancy and probabilistic prediction. My algorithm does not use all these concepts and is a rather simple set of rules that need to be followed while compressing and decompressing. (2)My question is am I wasting my time trying to come up with a new compression algorithm without having enough knowledge about existing compression schemes?
(3)Furthermore, if I manage to successfully compress a string can I extend my algorithm to other content like videos, images etc.?
(I understand that the third question is difficult to answer without knowledge about the compression algorithm. But I am afraid the algorithm is so rudimentary and nascent I feel ashamed about sharing it. Please feel free to ignore the third question if you have to)
Your question doesn't make sense as it stands (see answer #2), but I'll try to rephrase and you can let me know if I capture your question. Would modeling text using the probability of individual words make for a good text compression algorithm? Answer: No. That would be a zeroth order model, and would not be able to take advantage of higher order correlations, such as the conditional probability of a given word following the previous word. Simple existing text compressors that look for matching strings and varied character probabilities would perform better.
Yes, you are wasting your time trying to come up with a new compression algorithm without having enough knowledge about existing compression schemes. You should first learn about the techniques that have been applied over time to model data, textual and others, and the approaches to use the modeled information to compress the data. You need to study what has already been researched for decades before developing a new approach.
The compression part may extend, but the modeling part won't.
Do you mean like having a ranking table of words sorted by frequency and assign smaller "symbols" to those words that are repeated the most, therefore reducing the amount of information that needs to be transmitted?
That's basically how Huffman Coding works, the problem with compression is that you always hit a limit somewhere along the road, of course, if the set of things that you try to compress follows a particular pattern/distribution then it's possible to be really efficient about it, but for general purposes (audio/video/text/encrypted data that appears to be random) there is no (and I believe that there can't be) "best" compression technique.
Huffman Coding uses frequency on letters. You can do the same with words or with letter frequency in more dimensions, i.e. combinations of letters and their frequency.

Parsing natural language ingredient quantities for recipes [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm building a ruby recipe management application, and as part of it, I want to be able to parse ingredient quantities into a form I can compare and scale. I'm wondering what the best tools are for doing this.
I originally planned on a complex regex, then on some other code that converts human readable numbers like two or five into integers, and finally code that will convert say 1 cup and 3 teaspoons into some base measurement. I control the input, so I kept the actual ingredient separate. However, I noticed users inputting abstract measurements like to taste and 1 package. At least with the abstract measurements, I think I could just ignore them and scale and just scrape any number preceding them.
Here are some more examples
1 tall can
1/4 cup
2 Leaves
1 packet
To Taste
One
Two slices
3-4 fillets
Half-bunch
2 to 3 pinches (optional)
Are there any tricks to this? I have noticed users seem somewhat confused of what constitutes a quantity. I could try to enforce stricter rules and push things like tall can and leaves into the ingredient part. However, in order to enforce that, I need to be able to convey what's invalid.
I'm also not sure what the "base" measurement I should convert quantities into.
These are my goals.
To be able to scale recipes. Arbitrary units of measurement like
packages don't have to be scaled but precise ones like cups or
ounces need to be.
Figure out the "main" ingredients. In the context of this question, this will be done largely by figuring out what the largest ingredient is in the recipe. In production, there will have to be some sort of modifier based on the type of ingredient because, obviously, flour is almost never considered the "main" ingredient. However, chocolate can be used sparingly, and it can still be said a chocolate cake.
Normalize input. To keep some consistency on the site, I want to keep consistent abbreviations. For example, instead of pounds, it should be lbs.
You pose two problems, recognizing/extracting the quantity expressions (syntax) and figuring out what amount they mean (semantics).
Before you figure out whether regexps are enough to recognize the quantities, you should make yourself a good schema (grammar) of what they look like. Your examples look like this:
<amount> <unit> [of <ingredient>]
where <amount> can take many forms:
whole or decimal number, in digits (250, 0.75)
common fraction (3/4)
numeral in words (half, one, ten, twenty-five, three quarters)
determiner instead of a numeral ("an onion")
subjective (some, a few, several)
The amount can also be expressed as a range of two simple <amount>s:
two to three
2 to 3
2-3
five to 10
Then you have the units themselves:
general-purpose measurements (lb, oz, kg, g; pounds, ounces, etc.)
cooking units (Tb, tsp)
informal units (a pinch, a dash)
container sizes (package, bunch, large can)
no unit at all, for countable ingredients (as in "three lemons")
Finally, there's a special case of expressions that can never be combined with either amounts or units, so they effectively function as a combination of both:
a little
to taste
I'd suggest approaching this as a small parser, which you can make as detailed or as rough as you need to. It shouldn't be too hard to write regexps for all of those, if that's your tool of choice, but as you see it's not just a question of textual substitution. Pull the parts out and represent each ingredient as a triple (amount, unit, ingredient). (For countables, use a special unit "pieces" or whatever; for "a little" and the like, I'd treat them as special units).
That leaves the question of converting or comparing the quantities. Unit conversion has been done in lots of places, so at least for the official units you should have no trouble getting the conversion tables. Google will do it if you type "convert 4oz to grams", for example. Note that a Tbsp is either three or four tsp, depending on the country.
You can standardize to your favorite units pretty easily for well-defined units, but the informal units are a little trickier. For "a pinch", "a dash", and the like, I would suggest finding out the approximate weight so that you can scale properly (ten pinches = 2 grams, or whatever). Cans and the like are hopeless, unless you can look up the size of particular products.
On the other hand, subjective amounts are the easiest: If you scale up "to taste" ten times, it's still "to taste"!
One last thought: Some sort of database of ingredients is also needed for recognizing the main ingredients, since size matters: "One egg" is probably not the major ingredient, but "one small goat, quartered" may well be. I would consider it for version 2.
Regular expressions are difficult to get right for natural language parsing. NLTK, like you mentioned, would probably be a good option to look into otherwise you'll find yourself going around in circles trying to get the expressions right.
If you want something of the Ruby variety instead of NLTK, take a look at Treat:
https://github.com/louismullie/treat
Also, the Linguistics framework might be a good option as well:
http://deveiate.org/projects/Linguistics
EDIT:
I figured there had to already be a Ruby recipe parser out there, here's another option you might want to look into:
https://github.com/iancanderson/ingreedy
There is a lot of free training data available out there if you know how to write a good web scraper and parsing tool.
http://allrecipes.com/Recipe/Darias-Slow-Cooker-Beef-Stroganoff - This site seems to let you convert recipe quantities based on metric/imperial system and number of diners.
http://www.epicurious.com/tools/conversions/common - This site seems to have lots of conversion constants.
Some systematic scraping of existing recipe sites which present ingredients, procedures in some structured format (which you can discover by reading the underlying html) will help you build up a really large training data set which will make taking on such a problem much much easier.
When you have tons of data, even simple learning techniques can be pretty useful. Once you have a lot of data, you can use standard nlp tricks (ngrams, tf-idf, naive bayes, etc) to quickly do awesome things.
For example:
Main Ingredient-ness
Ingredients in a dish with a higher idf (inverse document frequency) are more likely to be main ingredients. Every dish mentions salt, so it should have very low idf. A lot fewer dishes mention oil, so it should have a higher idf. Most dishes probably have only one main protein, so phrases like 'chicken', 'tofu', etc should be rarer and much more likely to be main ingredients than salt, onions, oil, etc. Of course there may be items like 'cilantro' which might be rarer than 'chicken', but if you had scraped out some relevant metadata along with every dish, you will have signals that will help you fix this issue as well. Most chefs might not be using cilantro in their recipes, but the ones that do probably use it quite a lot. So for any ingredient name, you can figure out the name's idf by first considering only the authors that have mentioned the ingredient at least once, and then seeing the ingredient's idf on this subset of recipes.
Scaling recipes
Most recipe sites mention how many people does a particular dish serve, and have a separate ingredients list with appropriate quantities for that number of people.
For any particular ingredient, you can collect all the recipes that mention it and see what quantity of the ingredient was prescribed for what number of people. This should tell you what phrases are used to describe quantities for that ingredient, and how the numbers scale. Also you can now collect all the ingredients whose quantities have been described using a particular phrase (e.g. 'slices' -> (bread, cheese, tofu,...), 'cup' -> (rice, flour, nuts, ...)) and look at the most common of these phrases and manually write down how they would scale.
Normalize Input
This does not seem like a hard problem at all. Manually curating a list of common abbreviations and their full forms (e.g 'lbs' -> 'pounds', 'kgs' -> 'kilograms', 'oz' -> 'ounces', etc) should solve 90% of the problem. Adding new contractions to this list whenever you see them should make this list pretty comprehensive after a while.
In summary, I am asking you to majorly increase the size of your data and collect lots of relevant metadata along with each recipe you scrape (author info, food genre, etc), and use all this structured data along with simple NLP/ML tricks to solve most problems you will face while trying to build an intelligent recipe site.
As far as these go:
I'd hard code these up so that if you get more than so many oz, go to cups, if you get mroe than so many cups, go to pints, litters, gallons, etc. I don't know how you can avoid this unless someone already wrote the code to handle this.
If an ingredient is in the title, it's probably the main ingredient. You'll run into issues with "Oatmeal Raisin Cookies" though. As you've stated, flour, milk, etc aren't the main ingrediant. You'll also need to map bacon, pork chop, pork roast all to pork, and Steak, Hamburger, etc to beef possibly.
Again, this is just a look up on the amount of something, you know people are going to have lbs, oz, etc, so try to preempt them and write this as best you can. You might miss some, but as your site grows you'll be able to introduce a new filter.
If you go through all this work, consider releasing it so others don't have to :)

algorithm for data quality in a data warehouse

I'm looking for a good algorithm / method to check the data quality in a data warehouse.
Therefore I want to have some algorithm that "knows" the possible structure of the values and then checks if the values are a member of this structure and then decide if they are correct / not correct.
I thought about defining a regexp and the check each value whether it fits or not.
Is this a good way? Are there some good alternatives? (Any research papers?)
I have seen some authors suggest adding a special dimension called a data quality dimension to describe each facttable-record further.
Typical values in a data quality dimension could then be “Normal value,” “Out-of-bounds value,” “Unlikely value,” “Verified value,” “Unverified value,” and “Uncertain value.”
I would recommend using a dedicated data quality tool, like DataCleaner (http://datacleaner.eobjects.org), which I have been doing quite a lot of work on.
You need a tool that not only check strict rules like constraints, but also one that will give you a profile of your data and make it easy for you to explore and identify inconsistencies on your own. Try for example the "Pattern finder" which will tell you the patterns of your string values - something that will often reveal the outliers and errornous values. You can also use the tool for actual cleansing the data, by transforming values, extracting information from them or enriching using third party services. Good luck improving your data quality!