Goodness of fit Panel Data (fixed effects and random effects) [migrated] - stata

This question was migrated from Stack Overflow because it can be answered on Cross Validated.
Migrated 11 days ago.
I am running some Panel Data Models (fixed effects and random effects) to check the association between two variables. For thta, I am checking different covariates to assess possible confounding factors. I am a bit lost on what indicators of my STATA outputs should I consider to assess the goodness of fit and choose which specification suits better.
I have read about considering the Rsq (within), the F stat or the rho indicators, but I can not reach a conclusion.

Related

Best ways to evaluate randomness of a shuffled list [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 months ago.
Improve this question
I'm trying to qualify the randomness of some shuffled list. Given a list of distinct integers, I want to shuffle it with different random generators or methods and evaluate the quality of the resulting shuffling.
For now, I tried to do some kind of dice experiment. Given a list of input_size, I select some bucket in the list to be the "observed" one and then I shuffle the initial list num_runs * input_size (always starting with a fresh copy). I then look the frequencies of the elements that fell in the observed bucket. I then report the result on some plot. You can observe the results bellow for three different methods (line plots of the frequencies, I tried histogramms but it would look bad).
The dice experiment over three methods
Reporting plots only is not formal enough, I would like to report some numbers. What are the best ways to do it (or used in academic publications).
Thanks in advance.
Quantifying randomness isn't trivial.
You generate a huge amount of random numbers and then test if they have the properties exhibited from true random numbers. There are tons of properties you can test for, e.g., a bits should occure with a 50% probability.
There are a randomness test suites that combine a bunch of these tests and try to find statistical flaws in the pseudo random numbers.
PractRand is to my knowledge currently the most sophisticated randomness test suite.
I'd suggest you write a program that uses your method to repeatedly shuffle an array of e.g. [0..255] and write the raw bytes to stdout. (so the output is a pseudo random bit stream) Then you can pipe that into PractRand and it will quit once it finds statistical flaws. ./a.out | ./PractRand stdin.
TestU01's "Big Crush" is also a pretty good test suite, but it takes a very long time to run, and from my experiance PractRand finds more statistical flaws.
I suggest not to use the Diehard or the newer Dieharder test suites, because they aren't as powerful and have false positives, even when using CSPRNGs or true random number generators.

When applying word2vec, should I standardize the cosine values within each year, when comparing them across years?

I'm a researcher, and I'm trying to apply NPL to understand the temporal changes of the meaning of some words.
So far I have obtained the trained embeddings (word2vec, sgn) of several years with identical parameters in the training.
For example, if I want to test the change of cosine similarity of word A and word B over 5 years, should I just compute them and plot the cosine values?
The reason I'm asking this is that I found the overall cosine values (mean of all possible pairs within that year) differ across the 5 years. **For example, 1990:0.21, 1991:0.19, 1992:0.31, 1993:0.22, 1994:0.31. Does it mean in some years, all words are more similar to each other than other years??
Base on my limited understanding, I think the vectors are odds in logistic functions, so they shouldn't be significantly affected by the size of the corpus? Is it necessary for me to standardize the cosine values (of all pairs within each year) so I can compare the relative ranking change across years? Or just trust the raw cosine values and compare them across years?
In general you should not think of cosine-similarities as an absolute measure that'd be comparable between models. That is, you should not think of "0.7" cosine-similarity as anything like "70%" similar, and choose some arbitrary "70%" threshold to be used across models.
Instead, it's only a measure within a single model's induced space - with its effective 'scale' affected by all the parameters & the training data.
One small exercise that may help illustrate this: with the exact same data, train a 100d model, then a 200d model. Then look at some word pairs, or words alongside their nearest-neighbors ranked by cosine-similarity.
With enough training/data, generally the same highly-related words will be nearest-neighbors of each other. But the effective ranges of cosine-similarity values will be very different. If you chose a specific threshold in one model as meaning, "close enough to feed some other analysis", the same threshold would not be sufficient in the other. Every model is its own world, induced by the training data & parameters, as well as some sources of explicit or implicit randomness during training. (Several parts of the word2vec algorithm use random sampling, but also any efficient multi-threaded training will encounter arbitray differences in training-order via host OS thread-scheduling vagaries.)
If your parameters are identical, & the corpora very-alike in every measurable internal proportion, these effects might be minimized, but never eliminated.
For example, even if people's intended word meanings were perfectly identical, one year's training data might include more discussion of 'war' or 'politics' or some medical-topic, than another. In that case, the iterative, interleaved tug-of-war in training updates will mean words from that overrepresented domain have far more push-pull influence on the final model word positions – essentially warping subregions of the final space for finer distinctions some places, and thus *coarser distinctions in the less-updated zones.
That is, you shouldn't expect any global-per-model scaling factor (as you've implied might apply) to correct for any model-to-model differences. The influences of different data & training runs are far more subtle, and might affect different 'neighborhoods' of words differently.
Instead, when comparing different models, a more stable grounds for comparison is relative rankings or relative-proportions of words with respect to their closeness-to-others. Did words move into, or out of, each others' top-N neighbors? Did A move more closely to B than C did to D? etc.
Even there, you might want to be careful about differences in the full vocabulary: if A & B were each others' closest neighbors year 1, but 5 other words squeeze between them in year 2, did any word's meaning really change? Or might it simply be because those other words weren't even suitably represented in year 1 to receive any position, or previously had somewhat 'noisier' positions nearby? (As words get rarer their positions from run to run will be more idiosyncratic, based on their few usage examples, and the influences of those other sources of run-to-run 'noise'.)
Limiting all such analyses to very-well-represented words will minimize misinterpreting noise-in-the-models as something meaningful. Re-running models more than once, either with same parameters or slightly-different ones, or slightly-different training data subsets, and seeing which comparisons hold up across such changes, may also help determine which observed changes are robust, versus methodological artifacts such as jitter from run-to-run, or other sampling effects.
A few previous answers on similar questions about comparing word-vectors across different source corpora may have other useful ideas or caveats for you:
how calculate distance between 2 node2vec model
Word embeddings for the same word from two different texts
How to compare cosine similarities across three pretrained models?

Exclude variables in regression based on causality-criteria in SAS

I have done my best to search the web for an answer to my question, but haven't been able to find any. Maybe I'm not asking in the right way, or maybe my problem can't be solved... Well, here goes nothing!
When running a regression in SAS, it is possible to do backward or forward selection and thereby eliminating all insignificant variables, which is great, but just because the p-value of variable is ≤ 0.05, that doesn't necessarily mean that the result is correct.
E.g., I run a regression in SAS with the dependent variable being numbers of deaths due to illness and the independent variable being number of doctors. The result is significant with p ≤ 0.05, but the coefficient says, that as the number of doctors rises, the number of deaths also goes up. This would probably be the result of a spurious regression, but the causality is wrong, but SAS is only a computer, and doesn't know which way, the causality would go. (Of course it could also be true, that more doctors=more deaths due to some other factor, but let us ignore that for now).
My question is: Is it possible to make a regression and then tell SAS, that it must do backward/forward elimination, but according to some rules I set, it also has to exclude variables that don't meet these rules? E.g. if deaths goes up, as the number of doctors increase, exclude the variable number of doctors? And what would that
I really hope, that someone can help me, because I am running a regression for many different years with more than 50 variables, and it would be great if I didn't have to go through all results myself.
Thanks :)
I don't think this is possible or recommended. As mentioned, SAS is a computer and can't know which regression results are spurious. What if more doctors = more medical procedures = more death? Obviously you need to apply expert opinion to each situation but the above scenario is just as plausible.
You also mention 'share of docs' which isn't the actual number if I'm correct? So it may also be an artifact of how this metric is calculated.
If you have a specific set of rules you want to exclude that may be possible. But you have to first define all those rules and have a level of certainty regarding them.
If you need to specify unusual parameter selection criteria you could always roll your own machine learning by brute force: partition the data, run different regression models over all the partitions in macro loops, and use something like AIC to select the best model.
However, unless you are a machine learning expert it's probably best to start with something like proc glmselect.
SAS can do both forward selection and backwards elimination in the glmselect procedure, e.g.:
proc gmlselect data=...;
model ... / select=forward;
...
It would also be possible to combine both approaches - i.e. run several iterations of proc glmselect in macro loops, each with different model specifications, and then choose the best result.

How should we set the number of the neurons in the hidden layer in neural network? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
In neural network theory - setting up the size of hidden layers seems to be a really important issue. Is there any criteria how to choose the number of neurons in a hidden layer?
Yes - this is a really important issue. Basically there are two ways to do that:
Try different topologies and choose best: due to the fact that number of neurons and layers are a discrete parameters you cannot differentiate your loss function with respect to this parameters in order to use a gradient descent methods. So the easiest way is to simply set up different topologies and compare them using either cross-validation or division of your training set to - training / testing / validating parts. You can also use a grid / random search schemas to do that. Libraries like scikit-learn have appropriate modules for that.
Dropout: the training framework called dropout could also help. In this case you are setting up relatively big number of nodes in your layers and trying to adjust a dropout parameter for each layer. In this scenario - e.g. assuming that you will have a two-layer network with 100 nodes in your hidden layer with dropout_parameter = 0.6 you are learning the mixture of models - where every model is a neural network with size 40 (approximately 60 nodes are turned off). This might be also considered as figuring out the best topology for your task.
There are also a few other algorithms for creating and pruning hidden layer neurons on the fly. The one I'm most familiar with is Cascade Correlation, which gets pretty good performance for many applications, despite the fact that the hidden layer starts with a single neuron and adds others as needed.
For further reading see:
The original paper by Scott E. Fahlman and Christian Lebiere, The Cascade-
Correlation Learning Architecture.
• Gábor Balázs' Cascade Correlation Neural Networks: A Survey.
• Lutz Precehelt's Investigation of the CasCor Family of Learning Algorithms Investigation of the CasCor Family of Learning Algorithms.
There are many other such algorithms for dynamically constructing the hidden layer, which can be found scattered across the Internet in various .pdf research papers. Some sleuthing may be worthwhile to avoid reinventing the wheel and may turn up just the right method for the problem you're trying to solve. Neural net research is spread out across many varied disciplines so there's no telling what else is out there; keeping track of all the new algorithms is a daunting prospect. I hope that helps though.
You have to set the number of neurons in hidden layer in such a way that it shouldn't be more than # of your training example. There are no thumb rule for number of neurons.
Ex: If you are using MINIST Dataset then you might have ~ 78K training example. So make sure that combination of Neural Network (784-30-10) = 784*30 + 30*10 which are less than training examples. but if you use like (784-100-10) then it exceeds the # of training example and highly probable to over-fit.
In short, make sure you are not over-fitting and hence you have good chances to get good result.

frisbee trajectory [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
This is my first post. I'm the lead programmer on a FIRST robotics team, and this year's competition is about throwing Frisbees. I was wondering if there was some sort of "grand unified equation" for the trajectory that takes into account the air resistance, initial velocity, initial height, initial angle, etc. Basically, I would like to acquire data from an ultrasonic rangefinder, the encoders that determine the speed of our motors, the angle of our launcher, the rotational force (should be pretty constant. We'll determine this on our own) and the gravitational constant, and plug it into an equation in real time as we're lining up shots to verify/guesstimate whether or not we'll be close. If anyone has ever heard of such a thing, or knows where to find it, I would really appreciate it! (FYI, I have already done some research, and all I can find are a bunch of small equations for each aspect, such as rotation and whatnot. It'll ultimately be programmed in C++). Thanks!
I'm a mechanical engineer who writes software for a living. Before moving to tech startups, I worked for Lockheed Martin writing simulation software for rockets. I've got some chops in this area.
My professional instinct is that there is no such thing as a "grand unified equation". In fact, this is a hard enough problem that there might not be very good theoretical models for this even if they are correct: for instance, one of your equations will have to be the lift generated from the frisbee, which will depend on its cross-section, speed, angle of attack, and assumptions about the properties of the air. Unless you're going to put your frisbee in a wind tunnel, this equation will be, at best, an approximation.
It gets worse in the real world: will you be launching the frisbee where there is wind? Then you can kiss your models goodbye, because as casual frisbee players know, the wind is a huge disturbance. Your models can be fine, but the real world can be brutal to them.
The way this complexity is handled in the real world is that almost all systems have feedback: a pilot can correct for the wind, or a rocket's computer removes disturbances from differences in air density. Unless you put a microcontroller with control surfaces on the frisbee, you're just not going to get far with your open loop prediction - which I'm sure is the trap they're setting for you by making it a frisbee competition.
There is a solid engineering way to approach the problem. Give Newton a boot and do they physics equations yourselves.
This is the empirical modeling process: launch a frisbee across a matrix of pitch and roll angles, launch speeds, frisbee spin speeds, etc... and backfit a model to your results. This can be as easy as linear interpolation of the results of your table, so that any combination of input variables can generate a prediction.
It's not guess and check, because you populate your tables ahead of time, so can make some sort of prediction about the results. You will get much better information faster than trying the idealized models, though you will have to keep going to fetch your frisbee :)