How to do proportionate stratified sampling without replacement? - stata

I want to select my sample in Stata 13 based on three stratum variables with 12 strata in total (size - two strata; sector - three strata; intangible intensity - two strata). The selection should be proportional without replacement.
However, I can only find disproportionate selection commands that select for instance x% of each stratum.
Can anyone help me out with this problem?

Thank you for this discussion. I think I know where my problem was.
The command "gsample" can select strata based on different variables. Therefore, I thought I had to define three different stratum variables. But the solution should be more simple.
There are 12 strata in total (the large firms with high intensity in sector 1, the small firms with high intensity in sector 1, and so on) with each firm in the sample falling in to one of the strata.
All I have to do is creating a variable "strataident" with values from 1 to 12 identifying the different strata. I do this for the population dataset, so the number of firms falling into each stratum is representative for the population. The following code will provide me a stratified random sample that is representative for the population.
gsample 10, percent strata (strataident) wor
This command works as well and is much easier, see the example in 1:
gsample 10, percent wor strata(size sector intensity)

The problem is, that strata may "overlap". So you probably have to rebalance the sample after initial draft.
Now the question is, how this can be implemented. The final sample should represent the proportion of the population as good as possible.

Related

Generating group mean for a continuous variable in stata

I have percent change of a variable for 20 years. I want to find the average percent change for 3 years continuously over the 20 years. So, suppose I have the data from 2000-2020. I want to form the average of 2000,2001,2002, then, 2001,2002,2003, and so on. in groups of 3 till 2018,2019,2020 in Stata.
Please help me with the code.
This is just a running mean or moving average. (For some reason, running average and moving mean aren't expressions I ever hear.) So you need to tsset or xtset your data and then look at help tssmooth ma.

Independent variable to find seasonality effect?

I'm not sure if it's right to ask this here but any help greatly appreciated. I'm working on sas forecast studio.
This is my time series dataset (quarterly data):
Date e.g. 1-Jan-80, 1-Apr-80, 1-Jul-80
DateQ e.g. 1980Q1, 1980Q2, 1980Q3
Year e.g. 1980, 1981, 1982
GDP (dependable variable) e.g. 2650.1
T e.g. 1, 2, 3
Which of this variable, or should I create a new quarterly variable, to use as an independent variable for a linear regression to evaluate if there is any seasonal effect?
Seasonal effects should not be identified using simple linear regression on the time variable when analyzing time-series data. But, to answer your question, use date with the intnx() function to convert it to quarter.
data want;
format quarter yyq.;
set have;
quarter = intnx('quarter', date, 0, 'B');
run;
Seasonal effects can be identified a number of ways:
1. Graphing it
If a time series has a seasonal effect, it will tend to be clear. Simply looking at a graph of the data will let you know whether it is seasonal by your chosen interval.
In sashelp.air, it's very clear that there is a 12-month season.
2. Spectral Density Analysis
proc timeseries will give you a spectrum analysis to help identify significant seasons within the data. Peaks indicate possible cycles or seasons. You will need to do some filtering to a reasonable seasonal amount since the density may increase significantly after a certain point, and it is not representative of the true season.
Forecast Studio and Time Series Studio will do this for you and can give you similar output to the below.
proc timeseries data=sashelp.air
outspectra=outspectra;
id date interval=month;
var air;
spectra;
run;
proc sgplot data=outspectra;
where period BETWEEN 1 AND 24;
scatter x=period y=p;
series x=period y=p;
run;
We can see a strong indicator for a seasonality of 12. We also see some potential 3-month and 6-month cycles that could be tested within a model for significance.
3. ACF/PACF/IACF plots
Your ACF/PACF/IACF plots in Forecast Studio will also help you identify clear seasons.
The classic decaying suspension-bridge look is indicative of a seasonal effect. Note that the season increases around 12 and then decreases again. Additionally, the significant negative spike at 12 in the PACF and IACF plots are other indicators of a significant seasonal effect at 12.
Model Building and Testing
Tools like the seasonal augmented dickey fuller test that are available Forecast Studio can help you identify if you've captured seasonality and achieved stationarity after differencing.
The selection boxes in the Series view allow you to quickly add simple or seasonal differencing. Selecting (1) for simple differencing will add one simple difference. i.e:
y = y - lag(y)
Selecting (1) for seasonal differencing will add 1 seasonal difference. Note that when you create a project in Forecast Studio, the season is automatically diagnosed and assumed. This should be done after doing our diagnostics above for our best guess as to what the true season is. In our case, we've assumed our season is 12. This would be equivalent to:
y = y - lag12(y)
We can then use stationarity tests to ensure we've achieved stationarity. In our case, we'll add 1 simple and seasonal difference.
Notice how our white noise plot has improved and our spikes at 12 have decreased to non-significance. Additionally, our stationarity tests are looking good and significant - that is, there is no unit root present.
Adding Seasonal or Cyclical Effects
Your model choice will dictate how seasonal or cyclical effects are added. Differencing in an ARIMA model will take care of seasonality. Dummy variables can be used for additional cyclical effects in the ARIMA model. For example:
data want;
set have;
q1 = (qtr(date) = 1);
q2 = (qtr(date) = 2);
q3 = (qtr(date) = 3);
run;
UCMs can take care of all of these by adding both seasonal and cyclical effects. Holt-Winters ESMs take care of trend and seasonality without requiring dummy variables. Your modeling goals and performance considerations for each type of model will dictate which model you choose.

Add stars to p<.05 in correlation matrix in Stata

I'm hoping to one star to p<.05 and two stars to p<.001 in a correlation matrix in Stata. This is the code that I'm currently using. The code still generates a correlation matrix, but no stars appear in places where they should. Thanks for your help!
asdoc corr RELATIONSHIP anxiety BEH_SIM SIM_VALUES sptconf NEG_EFFICACY spteffort SPTEFFORT_OTHER COOP_MOTIV COMP_MOTIV, star(0.5), replace
First, you need to use pwcorr rather than corr to be able to add stars to your correlation matrix. Second, you should not have the second comma right after the star option.
For example, the code below will output a correlation matrix with 1 star if significant at a 10% level, 2 stars if significant at 5% level, and three stars if significant at a 1%
level.
asdoc pwcorr var1 var2 var3, star(all) replace
I do not believe you can specify star numbers and significant levels the way you would like to using asdoc. You can specify custom significance levels by using star(.05) rather than star(all) as I do above, but this will put one star by every correlation coefficient significant at a 5% level and I do not think you can specify more than 1 level at a time.
The author of asdoc is Professor Attaullah Shah. He is very helpful and responsive so you might ask him. If not currently possible, if you ask he may add your suggestion to a future asdoc update. Here is a link to his website: https://fintechprofessor.com/2019/06/01/export-correlation-table-to-word-with-stars-and-significance-level-using-asdoc/

Large z-score values

We was working on large datasets of telecom. when we standardized the data we’ve got big z-score it varies from -0.xxx to 300 or 400!
These attributes has for exemple min=0 and Max about 4,000,000
Yes somes variables has outliers. We’ll this have good results for clustering without dealing with outliers?
The results of the proc fastclus with 8 cluster lead to grouped cluster (the seventh has 1,600,000 observations) there one too with 1 observation.
What’s our problem?
https://medium.com/p/6b6056224c54/info?source=email-75f4ab5a8577-1529361861973-activity.response_createdhttps://medium.com/p/6b6056224c54/info?source=email-75f4ab5a8577-1529361861973-activity.response_created
Your variables likely are very skewed.
The use of z standardization on such variables is questionable. You probably should look into box-cox transformations, too.

SAS sequential regression (in Quandt's log likelihood method)

I am coding in SAS Enterprise Guide 4.2.
I am trying to calculate the Quandt's log likelihood ratio. But it is not important to understand that to understand my question.
The ratio is based on sequential regressions.
Namely regressions from 1 to t0 where 1<=t0<=T and T is the samplesize.
Illustration:
First perform regression on the first observation
Then perform regression on the first two observations
Then perform regression on the first 3 observations
...and so on
It is also performing a "forward regression" from t0+1 to T.
Illustration:
First perform regression on the last T-1 observations
Then perform regression on the last T-2 observations
Then perform regression on the last T-3 observations
...and so on
The regression is an Ordinary Least Squares regression.
After the regression is performed, the square of the residuals are summed.
So this is what I need.
For each observation t0 I want to:
do an OLS regression from 1 to t0 and sum up the square of the residuals
do an OLS regression from t0+1 to T and sum up the square of the residuals
The data consists of one group variable, one dependent variable and one independent variable.
The calculations should be performed grouped by the group variable (but that should'nt be too difficult).
I have been able to do part of this task myself, but it is horribly ineffeicient and since the data consists of over 1,000,000,000 observations efficiency is very important
I have also noticed that the procedure "autoreg" calculates the CUSUM statistic that is also based on sequential regression and therefore I suspect that this functionality could be availible in SAS but I haven't been able to find it.
And the part I am struggling with most right now is the summation.
Simple example of the summation I want to do:
Input:
col1 col2
1 2
2 5
5 4
7 6
Output:
col3
2 =1*2
15 =1*5+2*5
32 =1*4+2*4+5*4
90 =1*6+2*6+5*6+7*6
Has anyone encounter a similar problem or have any idea on how to solve it in an efficient way?
All help is welcome and feel free to ask me to clarify something if it is unclear.
As far as the summation goes, the below should work (though your input dataset must be sorted by group first).
Since the summation you're asking for is basically col2 multiplied by the cumulative sum of col1 within each group, you can use a retain statement to keep track of the sum of col1, and by-group processing to reset the cumulative sum each time the data step encounters a new group.
data output;
retain cusum;
set input;
by group;
if first.group then cusum = col1;
else cusum = cusum + col1;
col3 = cusum * col2;
drop cusum;
run;