why bootstrap returns empty and values for standard deviation and CI and others are missing - stata

I am trying to use the Stata bootstrap command to estimate the sd of the median of the hours variable from the nlsw88 dataset.
I can't get the answer but when I changed the variable in my code from hours to wage it works. The only difference between these two variables is that hours is integer while wage has decimals. I even changed the storage type of hours from byte to float but nothing changed.
sysuse nlsw88
bootstrap r(p50), nodots: summarize hours , detail

You are bootstrapping to estimate the median (p50), but in practice there is no variance in the median.
Not sure exactly how bootstrap works, but you are not setting a sample size and 48.75% of the observations has the value 40 which is also close to the mean. All of this makes it very likely that the medians in all samples are 40 and that there therefore is no variance between the samples.
You must pick a very small sample size to be likely to get any variance in the median between the randomly drawn samples. For example, if you only sample 5 observations each bootstrap round.
sysuse nlsw88
bootstrap r(p50) , size(5) nodots: summarize hours , detail
This might explain why you are getting missing values (which is your question), but it does not answer what you should do instead. Ask yourself if you really meant to use the median? Or if bootstrap is the right method to use here?

Related

Google Sheet Not Multiplying in IF Formula

I am trying to calculate a price based on rates. If the number is $20,000 or below, there is a flat rate of $700. If the number is between 20,001.01 and $50,000, the rate is 3.5% of the number. The rates continue to lower as the numbers go up. I can get Google Sheets to populate the box with $700 if it is below $20,000 but I can't seem to make it do the multiplication for me. The cell just shows C4*.035
I want it to multiply the number shown in the C4 square by the percentage listed.
Here is the code as it currently sits:
=if(AND(C4<=20000),"700",IF(AND(C4>=20000.01,C4<=50000),"C4*.035", IF(AND(C4>=50000.01,C4<=100000),"C4*.0325", IF(AND(C4>=100000.01),"C4*.03"))))
Note, I know nothing about coding so I apologize if it is sloppy or doesn't make sense. I tried to copy and format based on an example that was semi similar to mine.
try:
=IF(C4<=20000, 700,
IF(AND(C4>=20000.01, C4<=50000), C4*0.035,
IF(AND(C4>=50000.01, C4<=100000), C4*0.0325,
IF(AND(C4>=100000.01), C4*0.03))))
As BigBen noticed in his comment - there's a mistake in your formula. You should not use " " around the formula if you don't want it to be read as a string.
Actually more clean solution is using IFS formula for this task.
=ifs(C4<=20000,700,
C4<=50000,C4*0.035,
C4<=100000,C4*0.0325,
C4>100000,C4*0.03)

using a list of regressors and storing the values of betas

I have a list of circumstances and effects:
I want to generate a matrix with betas containing the values of betas. I am going to run the loop 10 times, because i am in fact going to bootstrap my observations.
So far I have tried:
local circumstances height weight
local effort training diet
foreach i in 1 10 {
reg outcome `circumstances' `effects'
* store in column i the values of betas of circumstances
* store in column i the values of betas of effort
}
Does anyone know what should the code look like in order to store those values?
Thank you
The pseudocode would first store in "column 1" the first lot of betas and then overwrite them (column 1) with the second lot of betas. Then it would do the same again for column 10 with the first lot of betas and the second lot of betas. That is a long way from anything that makes sense. Nothing in your pseudocode takes bootstrap samples from the dataset, although perhaps you are intending to add code for that later.
Stata doesn't really work with any idea of column numbers, although the idea makes sense to Mata.
Unless there are very specific reasons -- which you would need to spell out -- there is no need to write your own code ab initio for bootstrapping, as the whole point of bootstrap is to do that for you.
Here is complete code for a reproducible example of bootstrapping a silly regression:
sysuse auto, clear
bootstrap b_weight=_b[weight] b_price=_b[price] , reps(1000) seed(2803) : regress mpg weight price
See also the help for bootstrap to learn about its other options, including saving().
10 repetitions would be regarded as absurdly small for the number of bootstrap samples.

Correct values for SsaSpikeEstimator's pvalueHistoryLength

In the creation of a SsaSpikeEstimator instance by the DetectSpikeBySsa method, there is a parameter called pvalueHistoryLength - could anybody please help me understand, for any given time series with X points, which is the optimal value for this parameter?
I got similar issue, when I try to read the paper, https://arxiv.org/pdf/1206.6910.pdf, I notice one paragraph
Also, simulations and theory (Golyandina, 2010) show that it is
better to choose window length L smaller than half of the time series length
N. One of the recommended values is N/3.
Maybe that's why in the ML.Net Power Anomaly example, the value is chosen to be 30 for the 90 points dataset.

Calculating p-value by hand from Stata table

I want to ask a question on how to compute the p-value without a t-stat table, just by looking at the table, like on the first page of the pdf in the following link http://faculty.arts.ubc.ca/dwhistler/326UBC/stataHILL.pdf . Like if I don't know the value 0.062, how can I know it is 0.062 by looking at other information from the table?
You need to use the ttail() function, which returns the reverse cumulative Student's t distribution, aka the probability T > t:
display ttail(38,abs(_b[_cons]/_se[_cons]))*2
The first argument, 38, is the degrees of freedom (sample size less number of parameters), while the second, 1.92, is the absolute value of the coefficient of interest divided by its standard error, or the t-stat. The factor of two comes from the fact that Stata is doing a two-tailed test. You can also use the stored DoF with
display ttail(e(df_r),abs(_b[_cons]/_se[_cons]))*2
You can also do the integration of the t density by "hand" using Adrian Mander's integrate:
ssc install integrate
integrate, f(tden(38,x)) l(-1.92) u(1.92)
This gives you 0.93761229, but you want Pr(T>|t|), which is 1-0.93761229=0.06238771.
If you look at many statistics textbooks, you will find a table called the Z-table which will give you the probability that Z is beyond your test statistic. The table is actually a cumulative distribution function of the normal curve.
When people went to school with 4-function calculators, one or more of the questions on the statistics test would include a copy of this Z-table, and the dear students would have to interpolate columns of numbers to find the p-value. In your example, you would see the test statistic between .06 and .07 and those fingers would tap out that it was closer to .06 and do a linear interpolation to come up with .062.
Today, the p-value is something that Stata or SAS will calculate for you.
Here is another SO question that may be of interest: How do I calculate a p-value if I have the t-statistic and d.f. (in Perl)?
Here is a basic page on how to determine p-value "by hand": http://www.dummies.com/how-to/content/how-to-determine-a-pvalue-when-testing-a-null-hypo.html
Here is how you can determine p-value using Excel: http://ms-office.wonderhowto.com/how-to/find-p-value-with-excel-346366/
===EDIT===
My Stata text ("Microeconometrics using Stata", Revised Ed, Cameron & Trivedi) says the following on p. 402.
* p-values for t(30), F(1,30), Z, and chi(1) at y=2
. scalar y=2
. scalar p_t30 = 2 * ttail(30,y)
. scalar p_f1and30 = Ftail(1,30,y^2)
. scalar p_z = 2 * (1 - normal(y))
. scalar p_chi1 = chi2tail(1,y^2)
. display "p-values" " t(30)=" %7.4f p_t30
p-values t(30) = 0.0546

Individual score contributions in ML estimation

I've estimated a model via maximum likelihood in Stata and was surprised to find that estimated standard errors for one particular parameter are drastically smaller when clustering observations. I take it from the Stata manual on robust standard error estimation in ML that this can happen if the contributions of individual observations to the score (the derivative of the log-likelihood) tend to cancel each other within clusters.
I would now like to dig a little deeper into what exactly is happening and would therefore like to have a look at these score contributions. As far as I can see, however, Stata only gives me the total sum as e(gradient). Is there any way to pry the individual summands out of Stata?
If you have written your own command, you can create a new variable containing these scores using the ml score command. Official Stata commands and most finished user written commands will often have score as an option for predict, which does the same thing but with an easier syntax.
These will give you the score of the log likelihood ($\ell$) with respect to the linear predictor, $x\beta = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \elipses$. To get the derivative of the log likelihood with respect to an individual parameter, say $\beta_1$, you just use the chain rule:
$\frac{\partial \ell}{\partial \beta_1} = \frac{\partial \ell }{\partial x\beta} \frac{\partial x\beta}{\partial \beta_1}$
The scores returned by Stata are $ \frac{\partial \ell }{\partial x\beta}$, and $\frac{\partial x\beta}{\partial \beta_1} = x_1$.
So, to get the score for $\beta_1$ you just multiply the score returned by Stata and $x_1$.