When is an event too rare for predictive modelling to be worthwhile? - data-mining

Background
I built a complaints management system for my company. It works fine. I'm interested in using the data it contains to do predictive modelling on complaints. We have ~40,000 customers of whom ~400 have complained.
Problem
I want to use our complaints data to model the probability that any given customer will complain. My concern is that a model giving each customer a probability of 0.000 for complaining would already be 99% accurate and thus hard to improve upon. Is it even possible to build a useful predictive model of the kind I describe trying to predict such a rare event with so little data?

That is why there are alternative measures than just accuracy.
Here, recall is probably what you are interested in. An in order to balance precision and recall, F1 is a popular mixture that takes both into account.
But in general, avoid trying to break down things into a single number.
It's a 1 dimensional result, and too much of a simplification. In practise, you will want to study errors in detail, to avoid a systematic error from happening.

Related

Sentiment analysis feature extraction

I am new to NLP and feature extraction, i wish to create a machine learning model that can determine the sentiment of stock related social media posts. For feature extraction of my dataset I have opted to use Word2Vec. My question is:
Is it important to train my word2vec model on a corpus of stock related social media posts - the datasets that are available for this are not very large. Should I just use a much larger pretrained word vector ?
The only way to to tell what will work better for your goals, within your constraints of data/resources/time, is to try alternate approaches & compare the results on a repeatable quantititave evaluation.
Having training texts that are properly representative of your domain-of-interest can be quite important. You may need your representation of the word 'interest', for example, to represent that of stock/financial world, rather than the more general sense of the word.
But quantity of data is also quite important. With smaller datasets, none of your words may get great vectors, and words important to evaluating new posts may be missing or of very-poor quality. In some cases taking some pretrained set-of-vectors, with its larger vocabulary & sharper (but slightly-mismatched to domain) word-senses may be a net help.
Because these pull in different directions, there's no general answer. It will depend on your data, goals, limits, & skills. Only trying a range of alternative approaches, and comparing them, will tell you what should be done for your situation.
As this iterative, comparative experimental pattern repeats endlessly as your projects & knowledge grow – it's what the experts do! – it's also important to learn, & practice. There's no authority you can ask for any certain answer to many of these tradeoff questions.
Other observations on what you've said:
If you don't have a large dataset of posts, and well-labeled 'ground truth' for sentiment, your results may not be good. All these techniques benefit from larger training sets.
Sentiment analysis is often approached as a classification problem (assigning texts to bins of 'positive' or 'negative' sentiment, operhaps of multiple intensities) or a regression problem (assigning texts a value on numerical scale). There are many more-simple ways to create features for such processes that do not involve word2vec vectors – a somewhat more-advanced technique, which adds complexity. (In particular, word-vectors only give you features for individual words, not texts of many words, unless you add some other choices/steps.) If new to the sentiment-analysis domain, I would recommend against starting with word-vector features. Only consider adding them later, after you've achieved some initial baseline results without their extra complexity/choices. At that point, you'll also be able to tell if they're helping or not.

What is considered a good accuracy for trained Word2Vec on an analogy test?

After training Word2Vec, how high should the accuracy be during testing on analogies? What level of accuracy should be expected if it is trained well?
The analogy test is just a interesting automated way to evaluate models, or compare algorithms.
It might not be the best indicator of how well word-vectors will work for your own project-specific goals. (That is, a model which does better on word-analogies might be worse for whatever other info-retrieval, or classification, or other goal you're really pursuing.) So if at all possible, create an automated evaluation that's tuned to your own needs.
Note that the absolute analogy scores can also be quite sensitive to how you trim the vocabulary before training, or how you treat analogy-questions with out-of-vocabulary words, or whether you trim results at the end to just higher-frequency words. Certain choices for each of these may boost the supposed "correctness" of the simple analogy questions, but not improve the overall model for more realistic applications.
So there's no absolute accuracy rate on these simplistic questions that should be the target. Only relative rates are somewhat indicative - helping to show when more data, or tweaked training parameters, seem to improve the vectors. But even vectors with small apparent accuracies on generic analogies might be useful elsewhere.
All that said, you can review a demo notebook like the gensim "Comparison of FastText and Word2Vec" to see what sorts of accuracies on the Google word2vec.c `questions-words.txt' analogy set (40-60%) are achieved under some simple defaults and relatively small training sets (100MB-1GB).

About optimized math functions, ranges and intervals

I'm trying to wrap my head around how people that code math functions for games and rendering engines can use an optimized math function in an efficient way; let me explain that further.
There is an high need for fast trigonometric functions in those fields, at times you can optimize a sin, a cos or other functions by rewriting them in a different form that is valid only for a given interval, often times this means that your approximation of f(x) is just for the first quadrant, meaning 0 <= x <= pi/2 .
Now the input for your f(x) is still about all 4 quadrants, but the real formula only covers 1/4 of that interval, the straightforward solution is to detect the quadrant by analyzing the input and see in which range it belongs to, then you adjust the result of the formula accordingly if the input comes from a quadrant that is not the first quadrant .
This is good in theory but this also presents a couple of really bad problems, especially considering the fact that you are doing all this to steal a couple of cycles from your CPU ( you also get a consistent implementation, that is not platform dependent like an hardcoded fsin in Intel x86 asm that only works on x86 and has a certain error range, all of this may differ on other platforms with other asm instructions ), so you should keep things working at a concurrent and high performance level .
The reason I can't wrap my head around the "switch case" with quadrants solution is:
it just prevents possible optimizations, namely memoization, considering that you usually want to put that switch-case inside the same functions that actually computes the f(x), probably the situation can be improved by implementing the formula for f(x) outside said function, but this is will lead to a doubling in the number of functions to maintain for any given math library
increase probability of more branching with a concurrent execution
generally speaking doesn't lead to better, clean, dry code, and conditional statements are often times a potential source of bugs, I don't really like switch-cases and similar things .
Assuming that I can implement my cross-platform f(x) in C or C++, how the programmers in this field usually address the problem of translating and mapping the inputs, the quadrants to the result via the actual implementation ?
Note: In the below answer I am speaking very generally about code.
Assuming that I can implement my cross-platform f(x) in C or C++, how the programmers in this field usually address the problem of translating and mapping the inputs, the quadrants to the result via the actual implementation ?
The general answer to this is: In the most obvious and simplest way possible that achieves your purpose.
I'm not entirely sure I follow most of your arguments/questions but I have a feeling you are looking for problems where really none exist. Do you truly have the need to re-implement the trigonometric functions? Don't fall into the trap of NIH (Not Invented Here).
the straightforward solution is to detect the quadrant
Yes! I love straightforward code. Code that is perfectly obvious at a glance what it does. Now, sometimes, just sometimes, you have to do some crazy things to get it to do what you want: for performance, or avoiding bugs out of your control. But the first version should be most obvious and simple code that solves your problem. From there you do testing, profiling, and benchmarking and if (only if) you find performance or other issues, then you go into the crazy stuff.
This is good in theory but this also presents a couple of really bad problems,
I would say that this is good in theory and in practice for most cases and I definitely don't see any "bad" problems. Minor issues in specific corner cases or design requirements at most.
A few things on a few of your specific comments:
approximation of f(x) is just for the first quadrant: Yes, and there are multiple reasons for this. One simply is that most trigonometric functions have identities so you can easily use these to reduce range of input parameters. This is important as many numerical techniques only work over a specific range of inputs, or are more accurate/performant for small inputs. Next, for very large inputs you'll have to trim the range anyways for most numerical techniques to work or at least work in a decent amount of time and have sufficient accuracy. For example, look at the Taylor expansion for cos() and see how long it takes to converge sufficiently for large vs small inputs.
it just prevents possible optimizations: Chances are your c++ compiler these days is way better at optimizations than you are. Sometimes it isn't but the general procedure is to let the compiler do its optimization and only do manual optimizations where you have measured and proven that you need it. Theses days it is very non-intuitive to tell what code is faster by just looking at it (you can read all the questions on SO about performance issues and how crazy some of the root causes are).
namely memoization: I've never seen memoization in place for a double function. Just think how many doubles are there between 0 and 1. Now in reduced accuracy situations you can take advantage of it but this is easily implemented as a custom function tailored for that exact situation. Thinking about it, I'm not exactly sure how to implement memoization for a double function that actually means anything and doesn't loose accuracy or performance in the process.
increase probability of more branching with a concurrent execution: I'm not sure I'd implement trigonometric functions in a concurrent manner but I suppose its entirely possible to get some performance benefits. But again, the compiler is generally better at optimizations than you so let it do its jobs and then benchmark/profile to see if you really need to do better.
doesn't lead to better, clean, dry code: I'm not sure what exactly you mean here, or what "dry code" is for that matter. Yes, sometimes you can get into trouble by too many or too complex if/switch blocks but I can't see a simple check for 4 quadrants apply here...it's a pretty basic and simple case.
So for any platform I get the same y for the same values of x: My guess is that getting "exact" values for all 53 bits of double across multiple platforms and systems is not going to be possible. What is the result if you only have 52 bits correct? This would be a great area to do some tests in and see what you get.
I've used trigonometric functions in C for over 20 years and 99% of the time I just use whatever built-in function is supplied. In the rare case I need more performance (or accuracy) as proven by testing or benchmarking, only then do I actually roll my own custom implementation for that specific case. I don't rewrite the entire gamut of <math.h> functions in the hope that one day I might need them.
I would suggest try coding a few of these functions in as many ways as you can find and do some accuracy and benchmark tests. This will give you some practical knowledge and give you some hard data on whether you actually need to reimplement these functions or not. At the very least this should give you some practical experience with implementing these types of functions and chances are answer a lot of your questions in the process.

Two way clustering in ordered logit model, restricting rstudent to mitigate outlier effects

I have an ordered dependent variable (1 through 21) and continuous independent variables. I need to run the ordered logit model, clustering by firm and time, eliminating outliers with Studentized Residuals <-2.5 or > 2.5. I just know ologit command and some options for the command; however, I have no idea about how to do two way clustering and eliminate outliers with studentized residuals:
ologit rating3 securitized retained, cluster(firm)
As far as I know, two way clustering has only been extended to a few estimation commands (like ivreg2 from scc and tobit/logit/probit here). Eliminating outliers can easily be done on your own and there's no automated way of doing it.
Use the logit2.ado from the link Dimitriy gave (Mitchell Petersen's website) and modify it to use the ologit command. It's simple enough to do with a little trial and error. Good luck!
If you have a variable with 21 ordinal categories, I would have no problems treating that as a continuous one. If you want to back that up somehow, I wrote a paper on welfare measurement with ordinal variables, see DOI:10.1111/j.1475-4991.2008.00309.x. Then you can use ivreg2. You should be aware of all the issues involved with that estimator, in particular, that it implicitly assumed that the correlations are fully modeled by this two-way structure, and observations for firms i and j and times t and s are definitely uncorrelated for i!=j and t!=s. Sometimes, this is a strong assumption to make -- i.e., New York and New Jersey may be correlated in 2010, but New York 2010 is uncorrelated with New Jersey 2009.
I have no idea of what you might mean by ordinal outliers. Somebody must have piled a bunch of dissertation advice (or worse analysis requests) without really trying to make sense of every bit.

Parsing natural language ingredient quantities for recipes [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm building a ruby recipe management application, and as part of it, I want to be able to parse ingredient quantities into a form I can compare and scale. I'm wondering what the best tools are for doing this.
I originally planned on a complex regex, then on some other code that converts human readable numbers like two or five into integers, and finally code that will convert say 1 cup and 3 teaspoons into some base measurement. I control the input, so I kept the actual ingredient separate. However, I noticed users inputting abstract measurements like to taste and 1 package. At least with the abstract measurements, I think I could just ignore them and scale and just scrape any number preceding them.
Here are some more examples
1 tall can
1/4 cup
2 Leaves
1 packet
To Taste
One
Two slices
3-4 fillets
Half-bunch
2 to 3 pinches (optional)
Are there any tricks to this? I have noticed users seem somewhat confused of what constitutes a quantity. I could try to enforce stricter rules and push things like tall can and leaves into the ingredient part. However, in order to enforce that, I need to be able to convey what's invalid.
I'm also not sure what the "base" measurement I should convert quantities into.
These are my goals.
To be able to scale recipes. Arbitrary units of measurement like
packages don't have to be scaled but precise ones like cups or
ounces need to be.
Figure out the "main" ingredients. In the context of this question, this will be done largely by figuring out what the largest ingredient is in the recipe. In production, there will have to be some sort of modifier based on the type of ingredient because, obviously, flour is almost never considered the "main" ingredient. However, chocolate can be used sparingly, and it can still be said a chocolate cake.
Normalize input. To keep some consistency on the site, I want to keep consistent abbreviations. For example, instead of pounds, it should be lbs.
You pose two problems, recognizing/extracting the quantity expressions (syntax) and figuring out what amount they mean (semantics).
Before you figure out whether regexps are enough to recognize the quantities, you should make yourself a good schema (grammar) of what they look like. Your examples look like this:
<amount> <unit> [of <ingredient>]
where <amount> can take many forms:
whole or decimal number, in digits (250, 0.75)
common fraction (3/4)
numeral in words (half, one, ten, twenty-five, three quarters)
determiner instead of a numeral ("an onion")
subjective (some, a few, several)
The amount can also be expressed as a range of two simple <amount>s:
two to three
2 to 3
2-3
five to 10
Then you have the units themselves:
general-purpose measurements (lb, oz, kg, g; pounds, ounces, etc.)
cooking units (Tb, tsp)
informal units (a pinch, a dash)
container sizes (package, bunch, large can)
no unit at all, for countable ingredients (as in "three lemons")
Finally, there's a special case of expressions that can never be combined with either amounts or units, so they effectively function as a combination of both:
a little
to taste
I'd suggest approaching this as a small parser, which you can make as detailed or as rough as you need to. It shouldn't be too hard to write regexps for all of those, if that's your tool of choice, but as you see it's not just a question of textual substitution. Pull the parts out and represent each ingredient as a triple (amount, unit, ingredient). (For countables, use a special unit "pieces" or whatever; for "a little" and the like, I'd treat them as special units).
That leaves the question of converting or comparing the quantities. Unit conversion has been done in lots of places, so at least for the official units you should have no trouble getting the conversion tables. Google will do it if you type "convert 4oz to grams", for example. Note that a Tbsp is either three or four tsp, depending on the country.
You can standardize to your favorite units pretty easily for well-defined units, but the informal units are a little trickier. For "a pinch", "a dash", and the like, I would suggest finding out the approximate weight so that you can scale properly (ten pinches = 2 grams, or whatever). Cans and the like are hopeless, unless you can look up the size of particular products.
On the other hand, subjective amounts are the easiest: If you scale up "to taste" ten times, it's still "to taste"!
One last thought: Some sort of database of ingredients is also needed for recognizing the main ingredients, since size matters: "One egg" is probably not the major ingredient, but "one small goat, quartered" may well be. I would consider it for version 2.
Regular expressions are difficult to get right for natural language parsing. NLTK, like you mentioned, would probably be a good option to look into otherwise you'll find yourself going around in circles trying to get the expressions right.
If you want something of the Ruby variety instead of NLTK, take a look at Treat:
https://github.com/louismullie/treat
Also, the Linguistics framework might be a good option as well:
http://deveiate.org/projects/Linguistics
EDIT:
I figured there had to already be a Ruby recipe parser out there, here's another option you might want to look into:
https://github.com/iancanderson/ingreedy
There is a lot of free training data available out there if you know how to write a good web scraper and parsing tool.
http://allrecipes.com/Recipe/Darias-Slow-Cooker-Beef-Stroganoff - This site seems to let you convert recipe quantities based on metric/imperial system and number of diners.
http://www.epicurious.com/tools/conversions/common - This site seems to have lots of conversion constants.
Some systematic scraping of existing recipe sites which present ingredients, procedures in some structured format (which you can discover by reading the underlying html) will help you build up a really large training data set which will make taking on such a problem much much easier.
When you have tons of data, even simple learning techniques can be pretty useful. Once you have a lot of data, you can use standard nlp tricks (ngrams, tf-idf, naive bayes, etc) to quickly do awesome things.
For example:
Main Ingredient-ness
Ingredients in a dish with a higher idf (inverse document frequency) are more likely to be main ingredients. Every dish mentions salt, so it should have very low idf. A lot fewer dishes mention oil, so it should have a higher idf. Most dishes probably have only one main protein, so phrases like 'chicken', 'tofu', etc should be rarer and much more likely to be main ingredients than salt, onions, oil, etc. Of course there may be items like 'cilantro' which might be rarer than 'chicken', but if you had scraped out some relevant metadata along with every dish, you will have signals that will help you fix this issue as well. Most chefs might not be using cilantro in their recipes, but the ones that do probably use it quite a lot. So for any ingredient name, you can figure out the name's idf by first considering only the authors that have mentioned the ingredient at least once, and then seeing the ingredient's idf on this subset of recipes.
Scaling recipes
Most recipe sites mention how many people does a particular dish serve, and have a separate ingredients list with appropriate quantities for that number of people.
For any particular ingredient, you can collect all the recipes that mention it and see what quantity of the ingredient was prescribed for what number of people. This should tell you what phrases are used to describe quantities for that ingredient, and how the numbers scale. Also you can now collect all the ingredients whose quantities have been described using a particular phrase (e.g. 'slices' -> (bread, cheese, tofu,...), 'cup' -> (rice, flour, nuts, ...)) and look at the most common of these phrases and manually write down how they would scale.
Normalize Input
This does not seem like a hard problem at all. Manually curating a list of common abbreviations and their full forms (e.g 'lbs' -> 'pounds', 'kgs' -> 'kilograms', 'oz' -> 'ounces', etc) should solve 90% of the problem. Adding new contractions to this list whenever you see them should make this list pretty comprehensive after a while.
In summary, I am asking you to majorly increase the size of your data and collect lots of relevant metadata along with each recipe you scrape (author info, food genre, etc), and use all this structured data along with simple NLP/ML tricks to solve most problems you will face while trying to build an intelligent recipe site.
As far as these go:
I'd hard code these up so that if you get more than so many oz, go to cups, if you get mroe than so many cups, go to pints, litters, gallons, etc. I don't know how you can avoid this unless someone already wrote the code to handle this.
If an ingredient is in the title, it's probably the main ingredient. You'll run into issues with "Oatmeal Raisin Cookies" though. As you've stated, flour, milk, etc aren't the main ingrediant. You'll also need to map bacon, pork chop, pork roast all to pork, and Steak, Hamburger, etc to beef possibly.
Again, this is just a look up on the amount of something, you know people are going to have lbs, oz, etc, so try to preempt them and write this as best you can. You might miss some, but as your site grows you'll be able to introduce a new filter.
If you go through all this work, consider releasing it so others don't have to :)