sas proc tabulate (freq) - sas

I have the following question:
Some sample data:
data;
input id article sex count;
datalines;
1 139 1 2
2 139 2 2
3 146 2 1
4 146 2 2
5 146 1 0
6 111 2 10
6 111 1 1
;
run;
Now, I have this code:
proc tabulate;
freq count;
class article sex;
table article, sex /misstext='0';
run;
Is there any difference compared to the following code?
proc tabulate;
var count;
class article sex;
table article, sex*count;
run;
Or does does it exactly do the same thing? Which one is recommendable?

Take notice of the output produced by the run of the two tabulate variations.
For the data set at hand the results are the same, presented differently.
The first has sex class cells that are an implicit frequency (N) computation that is count weighted, also implicitly formatted as an integer. The implicits are default behavior in absence of other statements and options.
The second has sex class cells that are the computed sum of count, formatted with default 2 decimal places.
If the data set had additional var variables used in table, the statistical computations to perform, and the role of weighting, would be dependent on the nature of presentation you are making and the audience consuming it. You might want or not want 'count' frequency weighting affecting the statistical computations.
Ask 5 people for a recommendation, you might get 6!
From online documentation, compare the details of the FREQ statement to the WEIGHT statement:
FREQ variable;
Required Argument
variable
specifies a numeric variable whose value represents the frequency of the observation. If you use the FREQ statement, then the procedure assumes that each observation represents n observations, where n is the value of variable. If n is not an integer, then SAS truncates it. If n is less than 1 or is missing, then the procedure does not use that observation to calculate statistics.
The sum of the frequency variable represents the total number of observations.
and
WEIGHT variable;
Required Argument
variable
specifies a numeric variable whose values weight the values of the analysis variables. The values of the variable do not have to be integers. PROC TABULATE responds to weight values in accordance with the following table.
0 : Counts the observation in the total number of observations
<0 : Converts the value to zero and counts the observation in the total number of observations
.missing : Excludes the observation
To exclude observations that contain negative and zero weights from the analysis, use EXCLNPWGT. Note that most SAS/STAT procedures, such as PROC GLM, exclude negative and zero weights by default.
Note: Prior to Version 7 of SAS, the procedure did not exclude the observations with missing weights from the count of observations.
Restrictions
To compute weighted quantiles, use QMETHOD=OS in the PROC statement.
PROC TABULATE will not compute MODE when a weight variable is active. Instead, try using PROC UNIVARIATE when MODE needs to be computed and a weight variable is active.
Interaction
If you use the WEIGHT= option in a VAR statement to specify a weight variable, then PROC TABULATE uses this variable instead to weight those VAR statement variables.
Tip
When you use the WEIGHT statement, consider which value of the VARDEF= option is appropriate. See the discussion of VARDEF=divisor and the calculation of weighted statistics in the Keywords and Formulas section of this document.

Related

In SAS: How to produce odds ratios (OR) in the results of PROC GENMOD

Possibly related to this question: How can I print odds ratios as part of the results of a GENMOD procedure?
I am dealing with a wide dataset containing; a main exposure variable, a categorical variable Type (four levels), as several continuous and binary variables as confounding factors.
Additional info: The dataset contains multiple imputations.
I am using the following code:
Proc genmod;
Class ID Type (ref=first)
Model class1= Type;
estimate 'black' TYPE 0 1 1/exp;
estimate 'white' TYPE 1 0 1/exp;
estimate 'red' TYPE 0 1 0/exp;
Repeated ID;
By imputation;
Run;
I expected the results table to contain, among others, the beta for the exponential of every level of the categorical variable Type ( bar that variable's reference group). The actual results table lacks beta values, nor does the table have confidence intervals printed.
What syntax should I use to tell SAS to produce those numbers in the results? I have looked through SAS documentation, but I have yet found an answer.

How to compute the effect size for Friedman's test using sas?

I was looking for a way to compute the effect size for Friedman's test using sas, but could not find any reference. I wanted to see if there is any difference between the groups and what its size was.
Here is my code:
proc freq data=mydata;
tables id*b*y / cmh2 scores=rank noprint;
run;
These are the results:
The FREQ Procedure
Summary Statistics for b by y
Controlling for id
Cochran-Mantel-Haenszel Statistics (Based on Rank Scores)
Statistic Alternative Hypothesis DF Value Prob
1 Nonzero Correlation 1 230.7145 <.0001
2 Row Mean Scores Differ 1 230.7145 <.0001
This question is correlated with the one posted on Cross Validated, that is concerned with the general statistical formula to compute the effect size for Friedman's test. Here, I would like to find out how to get the effect size in sas.

calculate market weighted return using SAS

I have four variables Name, Date, MarketCap and Return. Name is the company name. Date is the time stamp. MarketCap shows the size of the company. Return is its return at day Date.
I want to create an additional variable MarketReturn which is the value weighted return of the market at each point time. For each day t, MarketCap weighted return = sum [ return(i)* (MarketCap(i)/Total(MarketCap) ] (return(i) is company i's return at day t).
The way I do this is very inefficient. I guess there must be some function can easily achieve this traget in SAS, So I want to ask if anyone can improve my code please.
step1: sort data by date
step2: calculate total market value at each day TotalMV = sum(MarketCap).
step3: calculate the weight for each company (weight = MarketCap/TotalMV)
step4: create a new variable 'Contribution' = Return * weight for each company
step5: sum up Contribution at each day. Sum(Contribution)
Weighted averages are supported in a number of SAS PROCs. One of the more common, all-around useful ones is PROC SUMMARY:
PROC SUMMARY NWAY DATA = my_data_set ;
CLASS Date ;
VAR Return / WEIGHT = MarketCap ;
OUTPUT
OUT = my_result_set
MEAN (Return) = MarketReturn
;
RUN;
The NWAY piece tells the PROC that the observations should be grouped only by what is stated in the CLASS statement - it shouldn't also provide an ungrouped grand total, etc.
The CLASS Date piece tells the PROC to group the observations by date. You do not need to pre-sort the data when you use CLASS. You do have to pre-sort if you say BY Date instead. The only rationale for using BY is if your dataset is very large and naturally ordered, you can gain some performance. Stick to CLASS in most cases.
VAR Return / WEIGHT = MarketCap tells the proc that any weighted calculations on Return should use MarketCap as the weight.
Lastly, the OUTPUT statement specifies the data set to write the results to (using the OUT option), and specifies the calculation of a mean on Return that will be written as MarketReturn.
There are many, many more things you can do with PROC SUMMARY. The documentation for PROC SUMMARY is sparse, but only because it is the nearly identical sibling of PROC MEANS, and SAS did not want to produce reams of mostly identical documentation for both. Here is the link to the SAS 9.4 PROC MEANS documentation. The main difference between the two PROCS is that SUMMARY only outputs to a dataset, while MEANS by default outputs to the screen. Try PROC MEANS if you want to see the result pop up on the screen right away.
The MEAN keyword in the OUTPUT statement comes from SAS's list of statistical keywords, a helpful reference for which is here.

Macro that outputs table with testing results of SAS table

Problem
I'm not a very experienced SAS user, but unfortunately the lab where I can access data is restricted to SAS. Also, I don't currently have access to the data since it is only available in the lab, so I've created simulated data for testing.
I need to create a macro that gets the values and dimensions from a PROC MEANS table and performs some tests that check whether or not the top two values from the data make up 90% of the results.
As an example, assume I have panel data that lists firms revenue, costs, and profits. I've created a table that lists n, sum, mean, median, and std. Now I need to check whether or not the top two firms make up 90% of the results and if so, flag if it's profit, revenue, or costs that makes up 90%.
I'm not sure how to get started
Here are the steps :
Read the data
Read the PROC MEAN table created, get dimensions, and variables.
Get top two firms in each variable and perform check
Create new table that lists variable, value from read table, largest and second largest, and flag.
Then print table
Simulated data :
https://www.dropbox.com/s/ypmri8s6i8irn8a/dataset.csv?dl=0
PROC MEANS Table
proc import datafile="/folders/myfolders/dataset.csv"
out=dt
dbms=csv
replace;
getnames=yes;
run;
TITLE "Macro Project Sample";
PROC MEANS n sum mean median std;
VAR V1 V2 V3;
RUN;
Desired Results :
Value Largest Sec. Largest Flag
V1 463138.09 9888.09 9847.13
V2 148.92 1.99 1.99
V3 11503375 9999900 1000000 Y
At the moment I can't open your simulated dataset but I can give you some advices, hope they will help.
You can add the n extreme values of given variables using the 'output out=' statement with the option IDGROUP.
Here an example using charity dataset ( run this to create it http://support.sas.com/documentation/cdl/en/proc/65145/HTML/default/viewer.htm#p1oii7oi6k9gfxn19hxiiszb70ms.htm)
proc means data=Charity;
var MoneyRaised HoursVolunteered;
output out=try sum=
IDGROUP ( MAX (Moneyraised HoursVolunteered) OUT[2] (moneyraised hoursvolunteered)=max1 max2);
run;
data var1 (keep=name1 _freq_ moneyraised max1_1 max1_2 rename=(moneyraised=value max1_1=largest max1_2=seclargest name1=name))
var2 (keep=name2 _freq_ HoursVolunteered max2_1 max2_2 rename=(HoursVolunteered=value max2_1=largest max2_2=seclargest name2=name));
length name1 name2 $4;
set try ;
name1='VAR1';
name2='VAR2';
run;
data finalmerge;
length flag $1;
set var1 var2;
if largest+seclargest > value*0.9 then flag='Y';
run;
in the proc means I choose to variables moneyraised and hoursvolunteered, you will choose your var1 var2 var3 and make your changes in all the program.
The IDgroup will output the max value for both variables, as you see in the parentheses, but with out[2], obviously largest and second largest.
You must rename them, I choose to rename max1 and max 2, then sas will add an _1 and _2 to the first and the second max values automatically.
All the output will be on the same line, so I do a datastep referencing 2 datasets in output (data var1 var2) keeping the variables needed and renaming them for the next merge, I also choose a naming system as you see.
Finally I'll merge the 2 datasets created and add the flag.
Here are some initial steps and pointers in a non macro approach which restructures the data in such a manner that no array processing is required. This approach should be good for teaching you a bit about manipulating data in SAS but will not be as fast a single pass approach (like the macros you originally posted) as it transposes and sorts the data.
First create some nice looking dummy data.
/* Create some dummy data with three variables to assess */
data have;
do firm = 1 to 3;
revenue = rand("uniform");
costs = rand("uniform");
profits = rand("uniform");
output;
end;
run;
Transpose the data so all the values are in one column (with the variable names in another).
/* Move from wide to deep table */
proc transpose
data = have
out = trans
name = Variable;
by firm;
var revenue costs profits;
run;
Sort the data so each variable is in a contiguous group of rows and the highest values are at the end of each Variable group.
/* Sort by Variable and then value
so the biggest values are at the end of each Variable group */
proc sort data = trans;
by Variable COL1;
run;
Because of the structure of this data, you could go down through each observation in turn, creating a running total, which when you get to the final observation in a Variable group would be the Variable total. In this observation you also have the largest value (the second largest was in the previous observation).
At this point you can create a data step that:
Is aware when it is in the first and last values of each variable group
by statement to make the data step aware of your groups
first.Variable temporary variable so you can initialise your total variable to 0
last.Variable temporary variable so you can output only the last line of each group
Sums up the values in each group
retain statement so SAS doesn't empty your total with each new observation
sum() function or + operator to create your total
Creates and populates new variables for the largest and second largest values in each group
lag() function or retain statement to keep the previous value (the second largest)
Creates your flag
Outputs your new variables at the end of each group
output statement to request an observation be stored
keep statement to select which variables you want
The macros you posted originally looked like they were meant to perform the analysis you are describing but with some extras (only positive values contributed to the Total, an arbitrary number of values could be included rather than just the top 2, the total was multiplied by another variable k1198, negative values where caught in the second largest, extra flags and values were calculated).

SAS: How do I point to a specific observation of a value?

I'm very new to SAS and I'm trying to figure out some basic things available in other languages.
I have a table
ID Number
-- ------
1 2
2 5
3 6
4 1
I would like to create a new variable where I sum the value of one observation of Number to each other observations, like
Number2 = Number + Number[3]
ID Number Number2
-- ------ ------
1 2 8
2 5 11
3 6 12
4 1 7
How to I get the value of third observation of Number and add this to each observation of Number in a new variable?
There are several ways to do this; here is one using the SAS POINT= option:
data have;
input ID Number;
datalines;
1 2
2 5
3 6
4 1
run;
data want;
retain adder;
drop adder;
if _n_=1 then do;
adder = 3;
set have point=adder;
adder = number;
end;
set have;
number = number + adder;
run;
The RETAIN and DROP statements define a temp variable to hold the value you want to add. RETAIN means the value is not to be re-initialized to missing each time through the data step and DROP means you do not want to include that variable in the output data set.
The POINT= option allows one to read a specific observation from a SAS data set. The _n_=1 part is a control mechanism to only execute that bit of code once, assigning the variable adder to the value of the third observation.
The next section reads the data set one observation at a time and adds applies your change.
Note that the same data set is read twice; a handy SAS feature.
I'll start by suggesting that Base SAS doesn't really work this way, normally; it's not that it can't, but normally you can solve most problems without pointing to a specific row.
So while this answer will solve your explicit problem, it's probably not something useful in a real world scenario; usually in the real world you'd have a match key or some other element other than 'row number' to combine with, and if you did then you could do it much more efficiently. You also likely could rearrange your data structure in a way that made this operation more convenient.
That said, the specific example you give is trivial:
data have;
input ID Number;
datalines;
1 2
2 5
3 6
4 1
;;;;
run;
data want;
set have;
_t = 3;
set have(rename=number=number3 keep=number) point=_t ;
number2=number+number3;
run;
If you have SAS/IML (SAS's matrix language), which is somewhat similar to R, then this is a very different story both in your likelihood to perform this operation and in how you'd do it.
proc iml;
a= {1 2, 2 5, 3 6, 4 1}; *create initial matrix;
b = a[,2] + a[3,2]; *create a new matrix which is the 2nd column of a added
elementwise to the value in the third row second column;
c = a||b; *append new matrix to a - could be done in same step of course;
print b c;
quit;
To do this with the First observation, it's a lot easier.
data want;
set have;
retain _firstpoint; *prevents _firstpoint from being set to missing each iteration;
if _n_ = 1 then _firstpoint=number; *on the first iteration (usually first row) set to number's value;
number = number - _firstpoint; *now subtract that from number to get relative value;
run;
I'll elaborate a little more on this. SAS works on a record-by-record level, where each record is independently processed in the DATA step. (PROCs on the other hand may not behave this way, though many do at some level). SAS, like SQl and similar databases, doesn't truly acknowledge that any row is "first" or "second" or "nth"; however, unlike SQL, it does let you pretend that it is, based on the current sort. The POINT= random access method is one way to go about doing that.
Most of the time, though, you're going to be using something in the data to determine what you want to do rather than some related to the ordering of the data. Here's a way you could do the same thing as the POINT= method, but using the value of ID:
data want;
if n = 1 then set have(where=(ID=3) rename=number=number3);
set have;
number2=number+number3;
run;
That in the first iteration of the data step (_N_=1) takes the row from HAVE where Id=3, and then takes the lines from have in order (really it does this:)
*check to see if _n_=1; it is; so take row id=3;
*take first row (id=1);
*check to see if _n_=1; it is not;
*take second row (id=2);
... continue ...
Variables that are in a SET statement are automatically retained, so NUMBER3 is automatically retained (yay!) and not set to missing between iterations of the data step loop. As long as you don't modify the value, it will stay for each iteration.