How to best calculate correlations for observations that are not independent of each other (in Stata)? - stata

I have the following case. My data set consists of 78 policy documents (= my observations). These were written by 50 different country governments (in the period between 2005 and 2020). While 27 countries have written only one policy document, 23 countries have written multiple policy documents. In the latter case, these same-country different-policy documents have usually been written years apart by different governments/administrations and different ministries. Nevertheless, I reckon there is probably a risk that the observations are not independent. My overarching question is, therefore: How would you calculate correlations in this case? More specifically:
Pearson assumes the independence of the observations, thus, is not suitable here, correct? Or could one even credibly argue that the observations are independent after all, since they were usually published many years (and therefore governments) apart and by different ministries?
Would "within-participants correlation" (Bland & Altman 1995 a & b) or "repeated measures correlation" (= RMCORR in R and Stata) be more suitable? Or is something else more appropriate?
Furthermore: Would I otherwise have to take into account any time effects when running correlations in my setting?
Thank you very much for your advice!
Disclaimer: also posted at Statalist here.

Related

When applying word2vec, should I standardize the cosine values within each year, when comparing them across years?

I'm a researcher, and I'm trying to apply NPL to understand the temporal changes of the meaning of some words.
So far I have obtained the trained embeddings (word2vec, sgn) of several years with identical parameters in the training.
For example, if I want to test the change of cosine similarity of word A and word B over 5 years, should I just compute them and plot the cosine values?
The reason I'm asking this is that I found the overall cosine values (mean of all possible pairs within that year) differ across the 5 years. **For example, 1990:0.21, 1991:0.19, 1992:0.31, 1993:0.22, 1994:0.31. Does it mean in some years, all words are more similar to each other than other years??
Base on my limited understanding, I think the vectors are odds in logistic functions, so they shouldn't be significantly affected by the size of the corpus? Is it necessary for me to standardize the cosine values (of all pairs within each year) so I can compare the relative ranking change across years? Or just trust the raw cosine values and compare them across years?
In general you should not think of cosine-similarities as an absolute measure that'd be comparable between models. That is, you should not think of "0.7" cosine-similarity as anything like "70%" similar, and choose some arbitrary "70%" threshold to be used across models.
Instead, it's only a measure within a single model's induced space - with its effective 'scale' affected by all the parameters & the training data.
One small exercise that may help illustrate this: with the exact same data, train a 100d model, then a 200d model. Then look at some word pairs, or words alongside their nearest-neighbors ranked by cosine-similarity.
With enough training/data, generally the same highly-related words will be nearest-neighbors of each other. But the effective ranges of cosine-similarity values will be very different. If you chose a specific threshold in one model as meaning, "close enough to feed some other analysis", the same threshold would not be sufficient in the other. Every model is its own world, induced by the training data & parameters, as well as some sources of explicit or implicit randomness during training. (Several parts of the word2vec algorithm use random sampling, but also any efficient multi-threaded training will encounter arbitray differences in training-order via host OS thread-scheduling vagaries.)
If your parameters are identical, & the corpora very-alike in every measurable internal proportion, these effects might be minimized, but never eliminated.
For example, even if people's intended word meanings were perfectly identical, one year's training data might include more discussion of 'war' or 'politics' or some medical-topic, than another. In that case, the iterative, interleaved tug-of-war in training updates will mean words from that overrepresented domain have far more push-pull influence on the final model word positions – essentially warping subregions of the final space for finer distinctions some places, and thus *coarser distinctions in the less-updated zones.
That is, you shouldn't expect any global-per-model scaling factor (as you've implied might apply) to correct for any model-to-model differences. The influences of different data & training runs are far more subtle, and might affect different 'neighborhoods' of words differently.
Instead, when comparing different models, a more stable grounds for comparison is relative rankings or relative-proportions of words with respect to their closeness-to-others. Did words move into, or out of, each others' top-N neighbors? Did A move more closely to B than C did to D? etc.
Even there, you might want to be careful about differences in the full vocabulary: if A & B were each others' closest neighbors year 1, but 5 other words squeeze between them in year 2, did any word's meaning really change? Or might it simply be because those other words weren't even suitably represented in year 1 to receive any position, or previously had somewhat 'noisier' positions nearby? (As words get rarer their positions from run to run will be more idiosyncratic, based on their few usage examples, and the influences of those other sources of run-to-run 'noise'.)
Limiting all such analyses to very-well-represented words will minimize misinterpreting noise-in-the-models as something meaningful. Re-running models more than once, either with same parameters or slightly-different ones, or slightly-different training data subsets, and seeing which comparisons hold up across such changes, may also help determine which observed changes are robust, versus methodological artifacts such as jitter from run-to-run, or other sampling effects.
A few previous answers on similar questions about comparing word-vectors across different source corpora may have other useful ideas or caveats for you:
how calculate distance between 2 node2vec model
Word embeddings for the same word from two different texts
How to compare cosine similarities across three pretrained models?

What is a proper method for minimizing st deviation of dependent variable (e.g. clustering?)

I'm stuck with minimizing the st deviation of a dependent variable being time difference in days. The mean is OK, but the deviation is terrible. Tried clustering by independent variables and noticed quite dissimilar clusters. Now, wondering:
1) How I can actually apply this knowledge from clustering to the independent variable? The fact is that it was not included in initial clustering analysis, as I know it's dependent on the others.
2) Given that I know the variable of time difference is dependent, should I run clustering of it with the variable of cluster number being the result of my initial clustering analysis? Would it help?
3) Is there any other technique apart from clustering that can help me somehow categorize observation groups so that for each group I would have a separate mean of the independent variable with low st deviation?
Any help highly appreciated!
P.S. I was using Stata and SPSS, though I can also use SAS if you can share the code.
It sounds like you're going about this all wrong. Here are some relevant points to consider.
It's more important for the variance to be consistent across groups than it is to be low.
Clustering is (generally) going to organize individuals based on similar patterns of the clustering variables.
Fewer observations will generally not decrease the size of your standard deviation.
Any time you take continuous variables (either IV or DVs) and convert them into categorical variables, you are removing variance from the equation, and including more measurement error. Sometimes there are good reasons to do this, often times there are not.
Analysis should be theory-driven whenever possible, as data driven analysis (like what you're trying to accomplish here) is more likely to produce results that can't be reproduced or generalized to other data sets, samples, or populations.

SAS Enterprise Guide, different treatments for missing variables

We are using the ESS data set, but are unsure how to deal with the issue of missing values in SAS Enterprise Guide. Our dependent variable is "subjective wellbeing", and aim to include a large amount of control variables - hence, we have a situation where we have a data set containing a lot of missing values.
We do not want to use "list-wise deletion". Instead, we would like to treat the different missings in different manners depending on the respondent's response: "no answer", "Not applicable", "refusal", "don't know". For example, we plan to conduct pair-wise deletion of non-applicable, while we might want to use e.g. the mean value for some other responses - depending on the question (under the assumption that the respondent's response provide information about MCAR, MAR, NMAR).
Our main questions are:
Currently, our missing variables are marked in different ways in the data set (99, 77, 999, 88 etc.), should we replace these values in Excel before proceeding in SAS Enterprise Guide? If yes - how should we best replace them as they are supposed to be treated in different ways?
How do we tell SAS Enterprise Guide to treat different missings in different ways?
If we use dummy variables to mark refusals for e.g. income, how do we include these in the final regression?
We have tried to read about this but are a bit confused, so we would really appreciate any help :)
On a technical note, SAS offers special missing values: .a .b .c etc. (not case sensitive).
Replace the number values in SAS e.g. 99 =.a 77 = .b
Decisions Trees for example will be able to handle these as separate values.
To keep the information of the missing observations in a regression model you will have to make some kind of tradeoff (find the least harmful solution to your problem).
One classical solution is to create dummy variables and replace the
missing values with the mean. Include both the dummies and the
original variables in the model. Possible problems: The coefficients
will be biased, multicollinearity, too many categories/variables.
Another approaches would be to BIN your variables into categories. Do
it just by value (e.g. deciles) and you may suffer information loss. Do it by theory and
you may suffer confirmation bias.
A more advanced approach would be to calculate the information
value
(http://support.sas.com/resources/papers/proceedings13/095-2013.pdf)
of your independent variables. Thereby replacing all values including
the missings. Of cause this will again lead to bias and loss of
information. But might at least be a good step to identify
useful/useless missing values.

Two way clustering in ordered logit model, restricting rstudent to mitigate outlier effects

I have an ordered dependent variable (1 through 21) and continuous independent variables. I need to run the ordered logit model, clustering by firm and time, eliminating outliers with Studentized Residuals <-2.5 or > 2.5. I just know ologit command and some options for the command; however, I have no idea about how to do two way clustering and eliminate outliers with studentized residuals:
ologit rating3 securitized retained, cluster(firm)
As far as I know, two way clustering has only been extended to a few estimation commands (like ivreg2 from scc and tobit/logit/probit here). Eliminating outliers can easily be done on your own and there's no automated way of doing it.
Use the logit2.ado from the link Dimitriy gave (Mitchell Petersen's website) and modify it to use the ologit command. It's simple enough to do with a little trial and error. Good luck!
If you have a variable with 21 ordinal categories, I would have no problems treating that as a continuous one. If you want to back that up somehow, I wrote a paper on welfare measurement with ordinal variables, see DOI:10.1111/j.1475-4991.2008.00309.x. Then you can use ivreg2. You should be aware of all the issues involved with that estimator, in particular, that it implicitly assumed that the correlations are fully modeled by this two-way structure, and observations for firms i and j and times t and s are definitely uncorrelated for i!=j and t!=s. Sometimes, this is a strong assumption to make -- i.e., New York and New Jersey may be correlated in 2010, but New York 2010 is uncorrelated with New Jersey 2009.
I have no idea of what you might mean by ordinal outliers. Somebody must have piled a bunch of dissertation advice (or worse analysis requests) without really trying to make sense of every bit.

Parsing natural language ingredient quantities for recipes [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm building a ruby recipe management application, and as part of it, I want to be able to parse ingredient quantities into a form I can compare and scale. I'm wondering what the best tools are for doing this.
I originally planned on a complex regex, then on some other code that converts human readable numbers like two or five into integers, and finally code that will convert say 1 cup and 3 teaspoons into some base measurement. I control the input, so I kept the actual ingredient separate. However, I noticed users inputting abstract measurements like to taste and 1 package. At least with the abstract measurements, I think I could just ignore them and scale and just scrape any number preceding them.
Here are some more examples
1 tall can
1/4 cup
2 Leaves
1 packet
To Taste
One
Two slices
3-4 fillets
Half-bunch
2 to 3 pinches (optional)
Are there any tricks to this? I have noticed users seem somewhat confused of what constitutes a quantity. I could try to enforce stricter rules and push things like tall can and leaves into the ingredient part. However, in order to enforce that, I need to be able to convey what's invalid.
I'm also not sure what the "base" measurement I should convert quantities into.
These are my goals.
To be able to scale recipes. Arbitrary units of measurement like
packages don't have to be scaled but precise ones like cups or
ounces need to be.
Figure out the "main" ingredients. In the context of this question, this will be done largely by figuring out what the largest ingredient is in the recipe. In production, there will have to be some sort of modifier based on the type of ingredient because, obviously, flour is almost never considered the "main" ingredient. However, chocolate can be used sparingly, and it can still be said a chocolate cake.
Normalize input. To keep some consistency on the site, I want to keep consistent abbreviations. For example, instead of pounds, it should be lbs.
You pose two problems, recognizing/extracting the quantity expressions (syntax) and figuring out what amount they mean (semantics).
Before you figure out whether regexps are enough to recognize the quantities, you should make yourself a good schema (grammar) of what they look like. Your examples look like this:
<amount> <unit> [of <ingredient>]
where <amount> can take many forms:
whole or decimal number, in digits (250, 0.75)
common fraction (3/4)
numeral in words (half, one, ten, twenty-five, three quarters)
determiner instead of a numeral ("an onion")
subjective (some, a few, several)
The amount can also be expressed as a range of two simple <amount>s:
two to three
2 to 3
2-3
five to 10
Then you have the units themselves:
general-purpose measurements (lb, oz, kg, g; pounds, ounces, etc.)
cooking units (Tb, tsp)
informal units (a pinch, a dash)
container sizes (package, bunch, large can)
no unit at all, for countable ingredients (as in "three lemons")
Finally, there's a special case of expressions that can never be combined with either amounts or units, so they effectively function as a combination of both:
a little
to taste
I'd suggest approaching this as a small parser, which you can make as detailed or as rough as you need to. It shouldn't be too hard to write regexps for all of those, if that's your tool of choice, but as you see it's not just a question of textual substitution. Pull the parts out and represent each ingredient as a triple (amount, unit, ingredient). (For countables, use a special unit "pieces" or whatever; for "a little" and the like, I'd treat them as special units).
That leaves the question of converting or comparing the quantities. Unit conversion has been done in lots of places, so at least for the official units you should have no trouble getting the conversion tables. Google will do it if you type "convert 4oz to grams", for example. Note that a Tbsp is either three or four tsp, depending on the country.
You can standardize to your favorite units pretty easily for well-defined units, but the informal units are a little trickier. For "a pinch", "a dash", and the like, I would suggest finding out the approximate weight so that you can scale properly (ten pinches = 2 grams, or whatever). Cans and the like are hopeless, unless you can look up the size of particular products.
On the other hand, subjective amounts are the easiest: If you scale up "to taste" ten times, it's still "to taste"!
One last thought: Some sort of database of ingredients is also needed for recognizing the main ingredients, since size matters: "One egg" is probably not the major ingredient, but "one small goat, quartered" may well be. I would consider it for version 2.
Regular expressions are difficult to get right for natural language parsing. NLTK, like you mentioned, would probably be a good option to look into otherwise you'll find yourself going around in circles trying to get the expressions right.
If you want something of the Ruby variety instead of NLTK, take a look at Treat:
https://github.com/louismullie/treat
Also, the Linguistics framework might be a good option as well:
http://deveiate.org/projects/Linguistics
EDIT:
I figured there had to already be a Ruby recipe parser out there, here's another option you might want to look into:
https://github.com/iancanderson/ingreedy
There is a lot of free training data available out there if you know how to write a good web scraper and parsing tool.
http://allrecipes.com/Recipe/Darias-Slow-Cooker-Beef-Stroganoff - This site seems to let you convert recipe quantities based on metric/imperial system and number of diners.
http://www.epicurious.com/tools/conversions/common - This site seems to have lots of conversion constants.
Some systematic scraping of existing recipe sites which present ingredients, procedures in some structured format (which you can discover by reading the underlying html) will help you build up a really large training data set which will make taking on such a problem much much easier.
When you have tons of data, even simple learning techniques can be pretty useful. Once you have a lot of data, you can use standard nlp tricks (ngrams, tf-idf, naive bayes, etc) to quickly do awesome things.
For example:
Main Ingredient-ness
Ingredients in a dish with a higher idf (inverse document frequency) are more likely to be main ingredients. Every dish mentions salt, so it should have very low idf. A lot fewer dishes mention oil, so it should have a higher idf. Most dishes probably have only one main protein, so phrases like 'chicken', 'tofu', etc should be rarer and much more likely to be main ingredients than salt, onions, oil, etc. Of course there may be items like 'cilantro' which might be rarer than 'chicken', but if you had scraped out some relevant metadata along with every dish, you will have signals that will help you fix this issue as well. Most chefs might not be using cilantro in their recipes, but the ones that do probably use it quite a lot. So for any ingredient name, you can figure out the name's idf by first considering only the authors that have mentioned the ingredient at least once, and then seeing the ingredient's idf on this subset of recipes.
Scaling recipes
Most recipe sites mention how many people does a particular dish serve, and have a separate ingredients list with appropriate quantities for that number of people.
For any particular ingredient, you can collect all the recipes that mention it and see what quantity of the ingredient was prescribed for what number of people. This should tell you what phrases are used to describe quantities for that ingredient, and how the numbers scale. Also you can now collect all the ingredients whose quantities have been described using a particular phrase (e.g. 'slices' -> (bread, cheese, tofu,...), 'cup' -> (rice, flour, nuts, ...)) and look at the most common of these phrases and manually write down how they would scale.
Normalize Input
This does not seem like a hard problem at all. Manually curating a list of common abbreviations and their full forms (e.g 'lbs' -> 'pounds', 'kgs' -> 'kilograms', 'oz' -> 'ounces', etc) should solve 90% of the problem. Adding new contractions to this list whenever you see them should make this list pretty comprehensive after a while.
In summary, I am asking you to majorly increase the size of your data and collect lots of relevant metadata along with each recipe you scrape (author info, food genre, etc), and use all this structured data along with simple NLP/ML tricks to solve most problems you will face while trying to build an intelligent recipe site.
As far as these go:
I'd hard code these up so that if you get more than so many oz, go to cups, if you get mroe than so many cups, go to pints, litters, gallons, etc. I don't know how you can avoid this unless someone already wrote the code to handle this.
If an ingredient is in the title, it's probably the main ingredient. You'll run into issues with "Oatmeal Raisin Cookies" though. As you've stated, flour, milk, etc aren't the main ingrediant. You'll also need to map bacon, pork chop, pork roast all to pork, and Steak, Hamburger, etc to beef possibly.
Again, this is just a look up on the amount of something, you know people are going to have lbs, oz, etc, so try to preempt them and write this as best you can. You might miss some, but as your site grows you'll be able to introduce a new filter.
If you go through all this work, consider releasing it so others don't have to :)