I am working on a research project that requires me to run a linear regression on the stock returns (of thousands of companies) against the market return for every single day between 1993 to 2014.
The data would be similar to (This is dummy data):
| Ticker | Time | Stock Return | Market Return |
|----------|----------|--------------|---------------|
| Facebook | 12:00:01 | 1% | 1.5% |
| Facebook | 12:00:02 | 1.5% | 2% |
| ... | | | |
| Apple | 12:00:01 | -0.5% | 1.5% |
| Apple | 12:00:03 | -0.3% | 2% |
The data volume is pretty huge. There are around 1.5 G of data for each day. There are 21 years of those data that I need to analyze and run regression on.
Regression formula is something similar to
Stock_Return = beta * Market_Return + alpha
where beta and alpha are two coefficients we are estimating. The coefficients are different for every company and every day.
Now, my question is, how to output the beta & alpha for each company and for each day into a data structure?
I was reading the SAS regression documentation, but it seems that the output is rather a text than a data structure.
The code from documentation:
proc reg;
model y=x;
run;
The output from the documentation:
There is no way that I can read over every beta for every company on every single day. There are tens of thousands of them.
Therefore, I was wondering if there is a way to output and extract the betas into a data structure?
I have background in OOP languages (python and java). Therefore the SAS can be really confusing sometimes ...
SAS in many ways is very similar to an object oriented programming language, though of course having features of functional languages and 4GLs also.
In this case, there is an object: the output delivery system object (ODS). Every procedure in SAS 9 that produces printed output produces it via the output delivery system, and you can generally obtain that output via ODS OUTPUT if you know the name of the object.
You can use ODS TRACE to see the names of the output produced by a particular proc.
data stocks;
set sashelp.stocks;
run;
ods trace on;
proc reg data=stocks;
by stock;
model close=open;
run;
ods trace off;
Note the names in the log. Then whatever you want output-wise, you just wrap the proc with ODS OUTPUT statements.
So if I want parameter estimates, I can grab them:
ods output ParameterEstimates=stockParams;
proc reg data=stocks;
by stock;
model close=open;
run;
ods output close;
You can have as many ODS OUTPUT statements as you want, if you want multiple datasets output.
Related
I have a dataset in Stata and would like to create a descriptive statistics table. The current problem I have is that my variables are both numerical and categorical. For the numerical variables, I know I can create a table easily with the mean, standard deviation and so on. I have just had a problem with categorical variables. For example, education encompasses 5 levels of different education and I would like to show the proportion of observations for each option within the education variable.This is just part of it. I wanted to create an overall table that has descriptive statistics for other variables, like gender, age, income, level of education and so on.
I like to use the user-contributed command table1 for this purpose. Type ssc install table1 to access the package.
sysuse auto
table1, vars(price contn \ rep78 cat)
+------------------------------------------------+
| Factor Level Value |
|------------------------------------------------|
| N 74 |
|------------------------------------------------|
| Price, mean (SD) 6,165.3 (2,949.5) |
|------------------------------------------------|
| Repair record 1978 1 2 (3%) |
| 2 8 (12%) |
| 3 30 (43%) |
| 4 18 (26%) |
| 5 11 (16%) |
+------------------------------------------------+
Type help table1 for additional options.
asdocx has a comprehensive template for creating table1. The template can summarize different types of variables such as continuous and binary / categorical variables in a single table. Table1 template allows different statistics with categorical / factor variables, continuous variables, and binary variables. The allowed statistics are given below:
mean Mean of the variable
sd Standard deviation of the variables
ci 95% Confidence interval
n Counts
N Counts
frequency Counts
percentage Count as Percentage of total *
% Count as percentage of total
The statistics presented in the above table can be selectively used with categorical, binary, and continuous variables. The default statistics for each type of variables are given below:
(1) Binary variables : Count (Percentages)
(2) Categorical variables : Count (Percentages)
(3) Continuous variables : Mean (95% confidence interval)
Table1 template also support survey weights. I have posted several examples on this page
Hello I would like to plot a Series SGPLOT where the Y axis is the percentage of a ratio of two values.
For example I have:
|Month|Chickens_sold|Total_sold|
|-----|-------------|----------|
|01 |5 |10 |
|02 |6 |13 |
|03 |4 |11 |
|04 |9 |9 |
I want a graph that has Month for the x axis and y is a calculated field of (Chicken_sold/Total_sold*100)
my code is something like this:
PROC SGPLOT DATA=Farm;
SERIES x=Month y=(Chicken_sold/Total_sold*100);
RUN;
Create your calculation within your dataset first.
data want;
set farm;
percent = Chicken_sold/Total_sold*100;
run;
proc sgplot data=want;
series x = month y = percent;
run;
Note that in CAS Actions on Viya, the concept of a calculated variable like this is valid and can be done. This is done with the computedVars and computedVarsProgram statements.
There are many other SAS PROCs that also let you run programs or functions within them, but SGPLOT is not one of them. Generally SGPLOT is designed around prepared data.
I am running the following SAS code in SAS Enterprise Guide 6.1 to get some summary stats on null/not null for all the variables in a table. This is producing the desired info via the 'results' tab, which creates a separate table for each result showing null/not null frequencies and percentages.
What I'd like to do is put the results into an output dataset with all the variables and stats in a single table.
proc format;
value $missfmt ' '='Missing' other='Not Missing';
value missfmt . ='Missing' other='Not Missing';
run;
proc freq data=mydatatable;
format _CHAR_ $missfmt.;
tables _CHAR_ / out=work.out1 missing missprint nocum;
format _NUMERIC_ missfmt.;
tables _NUMERIC_ / out=work.out2 missing missprint nocum;
run;
out1 and out2 are being generated into tables like this:
FieldName | Count | Percent
Not Missing | Not Missing | Not Missing
But are only populated with one variable each, and the frequency counts are not being shown.
The table I'm trying to create as output would be:
field | Missing | Not Missing | % Missing
FieldName1 | 100 | 100 | 50
FieldName2 | 3 | 97 | 3
The tables statement output options only apply to the last table requested. _CHAR_ resolves to (all character variables), but they're single tables, so you only get the last one requested.
You can get this one of two ways. Either use PROC TABULATE, which more readily deals with lists of variables; or use ODS OUTPUT to grab the proc freq output. Both output styles will take some work likely to get into exactly the structure you want.
ods output onewayfreqs=myfreqs; *use `ODS TRACE` to find this name if you do not know it;
proc freq data=sashelp.class;
tables _character_;
tables _numeric_;
run;
ods output close;
I am trying to calculate the 95% binomial Wilson confidence interval for the proportion of people completing treatment by year (dataset is line-listed for each person).
I want to store the results into a matrix so that I can use the putexcel command to export the results to an existing Excel spreadsheet without changing the formatting of the sheet. I have created a binary variable dscomplete_binary which is 0 for a person if treatment was not completed, and 1 if treatment was completed.
I have tried the following:
bysort year: ci dscomplete_binary, binomial wilson level(95)
This gives output of each year with the 95% confidence intervals. Previously I used statsby to collapse the dataset to store the results in variables but this clears the dataset from the memory and so I have to constantly re-open it.
Is there a way to run the command and store the results in a tabular format so that the data is stored in a similar way to this:
year mean LowerCI UpperCI
r1 2005 .7031588 .69229454 .71379805
r2 2006 .75532377 .74504232 .7653212
r3 2007 .78125924 .77125096 .79094833
r4 2008 .80014324 .79059798 .80935836
r5 2009 .81860977 .80955398 .82732689
r6 2010 .82641232 .81723672 .83522016
r7 2011 .81854123 .80955547 .82719356
r8 2012 .83497983 .82621944 .8433823
r9 2013 .85411799 .84527379 .86253893
r10 2014 .84461939 .83499599 .85377985
I have tried the following commands, which give different estimates to the binomial Wilson option:
svyset id2
bysort year: eststo: ci dscomplete_binary, binomial wilson level(95)
I think the postfile family of commands will help you here. This won't save your data into a matrix, but will save the results of the ci command into a new data set, which you name and whose structure you set. After the analysis is complete, you can load the data saved by postfile and export to Excel in the manner of your choosing.
For postfile, you analyze the data in a loop instead of using by or bysort.
Assuming the years in your data run 2005-2014, here is sample code:
/*make sure no postfile is open, in case a previous run did not close the file*/
cap postclose ci_results
/*create the postfile that will store results*/
postfile ci_results year mean lowerCI upperCI using ci_results.dta, replace
/*loop through years*/
forval y = 2004/2014 {
ci dscomplete_binary if year==`y', binomial wilson level(95)
/*store saved results from ci to postfile. Make sure the post statement contains results in the same order stated in postfile command.*/
post (`y') (r(mean)) (r(lb)) (r(ub))
}
/*close the postfile once you've looped through all the cases of interest*/
postclose ci_results
use ci_results.dta, clear
Once you load the ci_results.dta data into memory, you can apply any Excel exporting command you like.
This is a development of the suggestion already made to use statsby. The objections to it are quite puzzling, as it is easy to get back to the original dataset. There is some machine time in re-loading a dataset, but how much personal time has been spent in pursuit of an alternative?
Absent a dataset which we can use, I've provided a reproducible example.
If you wish to do this repeatedly, you'll write a more elaborate program to do it, which is what this forum is all about.
I leave how to export results to Excel as a matter for those so inclined: no details of what is wanted are provided in any case.
. sysuse auto, clear
(1978 Automobile Data)
. preserve
. statsby mean=r(mean) ub=r(ub) lb=r(lb), by(rep78) : ci foreign, binomial wilson level(95)
(running ci on estimation sample)
command: ci foreign, binomial wilson
mean: r(mean)
ub: r(ub)
lb: r(lb)
by: rep78
Statsby groups
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
.....
. list
+----------------------------------------+
| rep78 mean ub lb |
|----------------------------------------|
1. | 1 0 .6576198 0 |
2. | 2 0 .3244076 0 |
3. | 3 .1 .2562108 .0345999 |
4. | 4 .5 .7096898 .2903102 |
5. | 5 .8181818 .9486323 .5230194 |
+----------------------------------------+
. restore
. describe
The describe results will show that we are back where we started.
I have a dataset that results from the joins between a few results from a proc univariate.
After some more joins, I have a final dataset with a variable called "Measure", which has the name of certain measures, like 'mean' and 'standard deviation', for example, and other variables each with values for these measures, representing a month in a certain year.
I'd like to sort these measures in a particular order and, for now, I'm doing a proc transpose, doing a retain to stabilish the order I want, and doing another transpose. The problem is that this a really naive solution and I feel it just takes longer than it should take.
Is there a simpler/more effective way to do this sort?
An example of what I want to do, with random values:
What I have:
Measures | 2013/01 | 2013/02 | 2013/03
Mean | 10 | 9 | 11
Std Devi.| 1 | 1 | 1
Median | 3 | 5 | 4
What I want:
Measures | 2013/01 | 2013/02 | 2013/03
Std Devi.| 1 | 1 | 1
Median | 3 | 5 | 4
Mean | 10 | 9 | 11
I hope I was clear enough.
Thanks in advance
Couple of straightforward solutions. First, you could simply add a variable that you sort by and then drop. Don't need to transpose, just do it in the data step or PROC SQL after the join. if measures='Mean' then sortorder=3; else if measures='MEdian' then sortorder=2;... then sort by sortorder and then drop it in the PROC SORT step.
Second, if you're using entirely numeric values, you can use PROC MEANS to do the sorting for you, with a custom format that defines the order (using NOTSORTED and order=data on the class statement) and idgroup functionality in PROC MEANS to do the sorting and output the right values. This is overkill in most cases, but if the dataset is huge it might be appropriate.
Third, if you're doing the joins in SQL, you can order by the variable that you input into a order you want - I can explain that in more detail if you find that the most useful.