I have the following dataset:
Date Occupation Tota_Employed
1/1/2005 Teacher 45
1/1/2005 Economist 76
1/1/2005 Artist 14
2/1/2005 Doctor 26
2/1/2005 Economist 14
2/1/2005 Mathematician 10
and so on until November 2014
What I am trying to do is to calculate a column of percentage of employed by occupation such that my data will look like this:
Date Occupation Tota_Employed Percent_Emp_by_Occupation
1/1/2005 Teacher 45 33.33
1/1/2005 Economist 76 56.29
1/1/2005 Artist 14 10.37
2/1/2005 Doctor 26 52.00
2/1/2005 Economist 14 28.00
2/1/2005 Mathematician 10 20.00
where the percent_emp_by_occupation is calculated by dividing total_employed by each date (month&year) by total sum for each occupation to get the percentage:
Example for Teacher: (45/135)*100, where 135 is the sum of 45+76+14
I know I can get a table via proc tabulate, but was wondering if there is anyway of getting it through another procedure, specially since I wanted this as a separate dataset.
What is the best way to go about doing this? Thanks in advance.
Extract month and year from the date and create a key:
data ds;
set ds;
month=month(date);
year=year(date);
key=catx("_",month,year);
run;
Roll up the total at month level:
Proc sql;
create table month_total as
select key,sum(total_employed) as monthly_total
from ds
group by key;
quit;
Update the original data with the monthly total:
Proc sql;
create table ds as
select a.*,b.monthly_total
from ds as a left join month_total as b
on a.key=b.key;
quit;
This would lead to the following data set:
Date Occupation Tota_Employed monthly_total
1/1/2005 Teacher 45 135
1/1/2005 Economist 76 135
1/1/2005 Artist 14 135
Finally calculate the percentage as:
data ds;
set ds;
percentage=total_employed/monthly_total;
run;
Here you go:
proc sql;
create table occ2 as
select
occ.*,
total_employed/employed_by_date as percentage_employed_by_date format=percent7.1
from
occ a
join
(select
date,
sum(total_employed) as employed_by_date
from occ
group by date) b
on
a.date = b.date
;
quit;
Produces a table like so:
One last thought: you can create all of the totals you desire for this calculation in one pass of the data. I looked at a prior question you asked about this data and assumed that you used proc means to summarize your initial data by date and occupation. You can calculate the totals by date as well in the same procedure. I don't have your data, so I'll illustrate the concept with sashelp.class data set that comes with every SAS installation.
In this example, I want to get the total number of students by sex and age, but I also want to get the total students by sex because I will calculate the percentage of students by sex later. Here's how to summarize the data and get counts for 2 different levels of summary.
proc summary data=sashelp.class;
class sex age;
types sex sex*age;
var height;
output out=summary (drop=_freq_) n=count;
run;
The types statement identifies the levels of summary of my class variables. In this case, I want counts of just sex, as well as the counts of sex by age. Here's what the output looks like.
The _TYPE_ variable identifies the level of summary. The total count of sex is _TYPE_=2 while the count of sex by age is _TYPE_=3.
Then a simple SQL query to calculate the percentages within sex.
proc sql;
create table summary2 as
select
a.sex,
a.age,
a.count,
a.count/b.count as percent_of_sex format=percent7.1
from
summary (where=(_type_=3)) a /* sex * age */
join
summary (where=(_type_=2)) b /* sex */
on
a.sex = b.sex
;
quit;
The answer is to look back at the questions you have asked in the last few days about this same data and study those answers. Your answer is there.
While you are reviewing those answers, take time to thank them and give someone a check for helping you out.
Related
I have a simple dataset that I would like in a single output table. I would like the variables 'age' and 'sex' stacked against a third variable, 'q16'. An example of the expected/needed output is attached below. I also need to weight the table values using the field 'weight'.
Have tried various versions of proc tabulate, freq, report, but have not come up with a solution. What I'm hoping to get out of this post is a fresh look on my problem and see if the community has any other solutions that I can try.
data survey;
infile datalines dsd;
input age : $20. sex : $10. q16 : $20. weight;
datalines;
18 to 29,Male,VERY GOOD, 0.3984
46 to 64,Male,POOR, 1.6694
18 to 29,Female,POOR, 0.9696
46 to 64,Female,POOR, 0.6078
65 and over,Female,EXCELLENT, 1.0301
65 and over,Female,POOR, 0.7763
;
needed layout
As you can see in the attached image, it's two variables stacked vertically, but I need those two by a third variable called 'q16'. At this point, I'm not looking for design as much as replicating the table in the image with weighted values.
TABULATE Procedure can produce all the numbers, however, there are no features for reporting multiple statistics in a single cell corresponding to a dimensional crossing -- each number gets it's own cell.
A variable that has statistics computed has to be specified in the VAR statement, and counts or percents are for CLASS variables.
For example:
data have;
do person = 1 to 3218 + 1991;
length status $5;
status = ifc (ranuni(123) < 1991/(1991+3218), 'Dead', 'Alive');
if ranuni(123) < 0.001 then age = .; else age = floor(28+35*ranuni(123));
length gender $6;
if status = 'Dead'
then gender = ifc(ranuni(123) < 896 /(896+1095), 'Female', 'Male');
else gender = ifc(ranuni(123) < 1977/(1977+1241), 'Female', 'Male');
output;
end;
run;
proc tabulate data=have;
class status gender;
var age;
table
age * (N NMISS MEAN STD MIN MAX MEDIAN QRANGE)
gender * (N COLPCTN)
,
status;
run;
To get the exact table in your image you could compute the results via one or more statistics procedures and produce the output table via data _null_ and the ODSOUT component object.
I am trying to seek some validation, this may be trivial for most but I am by no means an expert at statistics. I am trying to select patients in the top 1% based on a score within each drug and location. The data would look something like this (on a much larger scale):
Patient drug place score
John a TX 12
Steven a TX 10
Jim B TX 9
Sara B TX 4
Tony B TX 2
Megan a OK 20
Tom a OK 10
Phil B OK 9
Karen B OK 2
The code snipit I have written to calculate those top 1% patients is as follows:
proc sql;
create table example as
select *,
score/avg(score) as test_measure
from prior_table
group by drug, place
having test_measure>.99;
quit;
Does this achieve what I am trying to do, or am going about it all wrong? Sorry if this is really trivial to most.
Thanks
There are multiple ways to calculate and estimate a percentile. A simple way is to use PROC SUMMARY
proc summary data=have;
var score;
output out=pct p99=p99;
run;
This will create a data set named pct with a variable p99 containing the 99th percentile.
Then filter your table for values >=p99
proc sql noprint;
create table want as
select a.*
from have as a
where a.score >= (select p99 from pct);
quit;
I have the following datasets:
Date Primary_Occupation Jobs
1/1/2005 Math 23
1/1/2005 Science 7
1/1/2005 Food 10
1/1/2006 Math 10
1/1/2006 Sales 64
1/1/2006 Transportation 21
All the way until 11/1/2015
I am trying to tabulate the percentage of jobs by Primary_Occupation and overtime
I saw that proc univariate has a bunch of percentile options, but neither of them seem to be the solution for what I am looking to do.
Here's a template for you to get started. It creates a table with frequencies and percentages. In this example, the output table "summary" contains summary stats for this class of students by sex and age.
proc freq data=sashelp.class;
table sex*age / out=summary;
run;
In a dataset in SAS, I have some observations multiple times. What I am trying to do is: I am trying to add a column with the frequency of each observation and make sure I keep it only one time in my dataset. I have to do this for a dataset with many rows and around 8 variables.
name id address age
jack 2 chicago 50
peter 4 new york 45
jack 2 chicago 50
This would have to become:
name id address age frequency
jack 2 chicago 50 2
peter 4 new york 45 1
Is there anybody who knows how to do this in SAS (preferably without using SQL)?
Thank you a lot!
#kl78 is right, proc summary is the best non-sql solution here. This runs in memory which can cause problems with very large datasets, but you should be ok with 8 columns.
class _all_ will group by all the variables and the frequency is output by default, so there's no need to specify any measures. I've dropped the other automatic variable, _type_, as it isn't relevant here and renamed _freq_.
data have;
input name $ id address &$ age;
datalines;
jack 2 chicago 50
peter 4 new york 45
jack 2 chicago 50
;
run;
proc summary data=have nway;
class _all_;
output out=want (drop=_type_ rename=(_freq_=frequency));
run;
I would like to replicate the output of PROC MEANS using PROC TABULATE. The reason for this is that I would like to have a profit percentage (or margin) as one of the variables in the PROC MEANS output, but would like to suppress the calculation for one or more of the statistics i.e. there will be a '-' or similar in the 'margin' row under 'N' and 'SUM.
Here is the sample data:
data have;
input username $ betdate : datetime. stake winnings;
dateOnly = datepart(betdate) ;
format betdate DATETIME.;
format dateOnly ddmmyy8.;
datalines;
player1 12NOV2008:12:04:01 90 -90
player1 04NOV2008:09:03:44 100 40
player2 07NOV2008:14:03:33 120 -120
player1 05NOV2008:09:00:00 50 15
player1 05NOV2008:09:05:00 30 5
player1 05NOV2008:09:00:05 20 10
player2 09NOV2008:10:05:10 10 -10
player2 15NOV2008:15:05:33 35 -35
player1 15NOV2008:15:05:33 35 15
player1 15NOV2008:15:05:33 35 15
run;
data want;
set have;
retain margin;
margin = (winnings) / stake;
PROC PRINT; RUN;
I have been calculating statistics with PROC MEANS (like below), but the value for the SUM statistics for the 'margin' variable means nothing: I would like to suppress this value. I have therefore been attempting to replicate this table using PROC TABULATE to have more control of the output, but have been unsuccessful so far.
proc means data=want N sum mean median stddev min max maxdec=2 order=freq STACKODS;
var stake winnings margin;
run;
proc tabulate data=want;
var stake winnings margin;
table stake * (N Sum mean Median StdDev Min Max);
run;
I would appreciate any help on this.
In principle, you can't create this type of output as a default part of the TABULATE function; in essence, you are asking for two different table definitions. Anything you do with the SAS syntax will basically amount to adding more dimensions to the table, but it won't fix your core problem.
You can use this code to get the tables you want, but they're still different tables:
PROC TABULATE DATA=want NOSEPS;
VAR stake winnings margin;
TABLE (stake winnings),(N SUM MEAN MEDIAN STDDEV MIN MAX);
TABLE (margin),(N MEAN MEDIAN STDDEV MIN MAX);
RUN;
There are some guides out there on hacking ODS to do what you want (namely, create "stacked tables" where several child tables are assembled into a single table. Check out here for an example. If you Google "SAS stack tables" you'll find more examples.
I've done this in HTML by creating a new tagset - basically, a special ODS destination that removes spaces between tables, etc. I don't have the code that I used anymore, unfortunately; I moved to R to do automated reporting.