I have a proc tabulate code as below:
proc tabulate data=want;
class TERM CAMPUS GENDER ;
var count ;
table GENDER ALL, (CAMPUS all)*TERM*(count='#Enrl '*f=best8.*sum=' ' count=''*colpctsum='% Tot Enrl ' ) / rts=20;
run;
and my result is as below
campus
East Campus
Term
Spring 2014 Spring 2015 Dfference
#Enrl %Tot_Enrl #Enrl %ToT_Enrl #Enrl %Tot_Enrl
Gender
Female 8462 52.86 8429 52.36 -33 -37.08
Male 7478 46.71 7608 47.26 130 146.07
None 68 0.42 60 0.37 -8 -8.89
All 16008 100.00 16907 100.00 89 100
I need to add % sign in the '%Tot_Enrl' variables.
Also can i remove campus and term titles? I have 'campus' title and 'east campus' title. So i need to remove 'Campus' and 'Term'. is that possible?
You need some more =' ' to get rid of Campus and Term. To get percent sign, you need a format; SAS shows you how here.
data want;
format term $15. campus $15.;
input term $ & campus $ & gender $ count;
datalines;
Spring 2014 East Campus Female 8462
Spring 2014 East Campus Male 7478
Spring 2014 East Campus None 68
Spring 2015 East Campus Female 8429
Spring 2015 East Campus Male 7608
Spring 2015 East Campus None 60
Difference East Campus Female -33
Difference East Campus Male 130
Difference East Campus None -8
;;;;
run;
proc format;
picture mypct(round) low-high='000,009.90%' (mult=100);
run;
proc tabulate data=want;
class TERM CAMPUS GENDER ;
var count ;
table GENDER ALL, (CAMPUS=' ' all)*TERM=' '*(count='#Enrl '*f=best8.*sum=' ' count=''*colpctsum='% Tot Enrl '*f=mypct. ) / rts=20;
run;
The sorting ends up wrong, I'm guessing you have a formatted numeric for some of that, but it gets the idea across.
Related
This is some example data, real data is more complex, other fields and about 40000 observations and up to 180 values per id (i know that i will get 360 rows in transposed table, but thats ok):
Data have;
input lastname firstname $ value;
datalines;
miller george 47
miller george 45
miller henry 44
miller peter 45
smith peter 42
smith frank 46
;
run;
And i want it to transpose in this way, so I have lastname, and then alternating firstname and value for ervery line matching the lastname.
data want:
Lastname Firstname1 Value1 Firstname2 value2 Firstname3 Value3 firstname4 value4
miller george 47 george 45 henry 44 peter 45
smith peter 42 frank 46
I tried a bit with proc transpose, but i was not able to build a table exactly the way i want it, described above. I need the want table exactly that way (real data is more complex and with other fields), so please no answers which propose to create a want table with other layout.
proc summary has a very useful function to do this, idgroup. You need to specify how many values you have per lastname, so I've included a step to calculate the maximum number.
Data have;
input lastname $ firstname $ value;
datalines;
miller george 47
miller george 45
miller henry 44
miller peter 45
smith peter 42
smith frank 46
;
run;
/* get frequency count of lastnames */
proc freq data=have noprint order=freq;
table lastname / out=name_freq;
run;
/* store maximum into a macro variable (first record will be the highest) */
data _null_;
set name_freq (obs=1);
call symput('max_num',count);
run;
%put &max_num.;
/* transpose data using proc summary */
proc summary data=have nway;
class lastname;
output out=want (drop=_:)
idgroup(out[&max_num.] (firstname value)=) / autoname;
run;
I want to sum over a specific variable in my dataset, without loosing all the other columns. I have tried the following code:
proc summary data=work.test nway missing;
class var_1 var_2 ; *groups;
var salary;
id _character_ _numeric_; * keeps all variables;
output out=test2(drop=_:) sum= ;
run;
But it does not seem to sum properly, and for the "salary" column I'm just left with the value of the last value in each group (var_1 and var_2). If I remove
id _character_ _numeric_;
it works fine, but I loose all other columns.
Example:
data:
data salary;
input name $ dept $ Salary Sex $;
datalines;
John Sales 23 M
John Sales 43 M
Mary Acctng 21 F
;
desired output:
John Sales 66 M
Mary Acctng 21 F
I think this does what you want. You still get warnings about name conflicts and variables being dropped but at least the ones you want are kept. The ID statement is depreciated in favor in the new and better IDGROUP output statement option.
You could add the AUTONAME option to the output statement if you wanted PROC SUMMARY to automatically rename the conflicting variables.
data salary;
input name $ dept $ Salary Sex $;
datalines;
John Sales 23 M
John Sales 43 M
Mary Acctng 21 F
;;;;
run;
proc print;
run;
proc summary nway missing;
class name dept;
var salary;
output out=test2(drop=_:) sum= idgroup(out(_all_)=);
run;
proc print;
run;
Try this:
data salary;
input name $ dept $ Salary Sex $;
datalines;
John Sales 23 M
John Sales 43 M
Mary Acctng 21 F
;
proc sql;
create table salary2 as
select *,
monotonic() as n,
sum(salary) as sum_salary
from salary
group by name
having max(n)=n;
quit;
I wasn't aware that SAS did this, but the problem appears to lie in the fact that the id statement takes preference over the var statement. By including all variables in the id statement, all the output is showing is the maximum value for each variable, including Salary.
One option is to pull a list of the variables not included in the class or var statements from dictionary.columns, then use that list in the id statement. Just be aware that proc summary runs in memory and I have come across out of memory problems in the past when many variables have been included in the id statement
data salary;
input name $ dept $ Salary Sex $;
datalines;
John Sales 23 M
John Sales 43 M
Mary Acctng 21 F
;
proc sql noprint;
select name into :cols separated by ' '
from dictionary.columns
where libname='WORK'
and
memname='SALARY'
and
name not in ('name','Salary');
quit;
%put &cols.;
proc summary data=salary nway missing;
class name;
var salary;
id &cols.;
output out=want (drop=_:) sum=;
run;
In a dataset in SAS, I have some observations multiple times. What I am trying to do is: I am trying to add a column with the frequency of each observation and make sure I keep it only one time in my dataset. I have to do this for a dataset with many rows and around 8 variables.
name id address age
jack 2 chicago 50
peter 4 new york 45
jack 2 chicago 50
This would have to become:
name id address age frequency
jack 2 chicago 50 2
peter 4 new york 45 1
Is there anybody who knows how to do this in SAS (preferably without using SQL)?
Thank you a lot!
#kl78 is right, proc summary is the best non-sql solution here. This runs in memory which can cause problems with very large datasets, but you should be ok with 8 columns.
class _all_ will group by all the variables and the frequency is output by default, so there's no need to specify any measures. I've dropped the other automatic variable, _type_, as it isn't relevant here and renamed _freq_.
data have;
input name $ id address &$ age;
datalines;
jack 2 chicago 50
peter 4 new york 45
jack 2 chicago 50
;
run;
proc summary data=have nway;
class _all_;
output out=want (drop=_type_ rename=(_freq_=frequency));
run;
Following the question asked about throwing out the trimmed mean of the proc univariate in a table :
SAS: PROC UNIVARIATE: Output trimmed mean to dataset
I would like to output a trimmed mean from a proc univariate by group. However the ods output does not seem to work with noprint and there are just too many group id for it to work out . Any sidestep to this problem?
proc univariate data = Table1 idout trim=1;
var DaysBtwPay;
by id;
trimmedmeans = trimMean2 (keep = id Mean stdMean);
run;
Doh!, it appears you can use ods _all_ close; option to suppress the HTML output instead of going through the trouble of writing your own datastep routine.
%let trimmed = 1;
proc sort data=sashelp.shoes out=have;
by region;
run;
ods _all_ close;
PROC UNIVARIATE DATA=have trimmed=&trimmed. ;
VAR returns;
by region;
ods output TrimmedMeans=trimmedMeansUni ;
run;
I can't think of anyway around this issue other than to write your own data step to calculate trimmed mean.
This can be done in 2 steps.
Step-1:
In this step we would want to know how many observations are there in each by-group and the simple average of your measurement variable. In the next step, simple average will be returned when calculating trimmed mean is infeasible. For example: If you want to exclude the extreme 5 obs but you only 7 observations in a by-group then proc univarite returns a missing value.
Notice the order by clause - this ordering is used in excluding extreme obs.
proc sql;
create table inputForTmeans as
select
a.region /*your by-group var*/
,a.returns /*your measurement variable of interest*/
,b.count
,b.simpleAvgReturns
from sashelp.shoes as a
inner join (select region, count(*) as count, mean(returns) as simpleAvgReturns
from sashelp.shoes
group by region) as b
on a.region = b.region
order by
a.region
,a.returns;
quit;
Step-2:
%let trimmed = 1; /*no. of extreme obs to exclude from mean calculation*/
data trimmedMean;
set inputForTmeans;
row_count+1; /*counter variable to number each obs in a by-group*/
by region returns;
if first.region then do;
row_count=1;
returnsSum=.;
end;
if &trimmed.<row_count <=(count - &trimmed.) then returnsSum+returns;
/***************************************************************************/
if last.region then do;
trimmedMeanreturns = coalesce(returnsSum/(count - 2*&trimmed.), simpleAvgReturns) ;
N = row_count;
trimmedRowCount = 2*&trimmed.;
output;
end;
keep region trimmedMeanreturns N count trimmedRowCount ;
/***************************************************************************/
run;
Output:
%let trimmed = 1;
region DataStep ProcUnivariate
Africa 1183.962963 1183.963
Asia 662 662
Canada 3089.1428571 3089.143
Central America/Caribbean 3561.1333333 3561.133
Eastern Europe 2665.137931 2665.138
Middle East 6794.3181818 6794.318
Pacific 1538.5813953 1538.581
South America 1824.1153846 1824.115
United States 4462.4210526 4462.421
Western Europe 2538.05 2538.05
%let trimmed = 14;
region DataStep ProcUnivariate
Africa 897.92857143 897.9
Asia 778.21428571 .
Canada 1098.1111111 1098.1
Central America/Caribbean 2289.25 2289.3
Eastern Europe 2559.6666667 2559.7
Middle East 8620 .
Pacific 895.88235294 895.9
South America 1538.5769231 1538.6
United States 4010.4166667 4010.4
Western Europe 1968.5882353 1968.6
Output from the datastep for trimmed=1:
Count: No. of rows in the by-group
N: Ignore this column - same as Count
trimmedRowCount: no. of extreme rows excluded. If trimmedRowCount = Count then trimmedMeanreturns is the SimpleAverage
Region count trimmedMeanreturns N trimmedRowCount
Africa 56 1183.963 56 2
Asia 14 662 14 2
Canada 37 3089.143 37 2
Central America/Caribbean 32 3561.133 32 2
Eastern Europe 31 2665.138 31 2
Middle East 24 6794.318 24 2
Pacific 45 1538.581 45 2
South America 54 1824.115 54 2
United States 40 4462.421 40 2
Western Europe 62 2538.05 62 2
I am interested in dividing my data into thirds, but I only have a summary table of counts by a state. Specifically, I have estimated enrollment counts by state, and I would like to calculate what states comprise the top third of all enrollments. So, the top third should include at least a total cumulative percentage of .33333...
I have tried various means of specifying cumulative percentages between .33333 and .40000 but with no success in specifying the general case. PROC RANKalso can't be used because the data is organized as a frequency table...
I have included some dummy (but representative) data below.
data state_counts;
input state $20. enrollment;
cards;
CALIFORNIA 440233
TEXAS 318921
NEW YORK 224867
FLORIDA 181517
ILLINOIS 162664
PENNSYLVANIA 155958
OHIO 141083
MICHIGAN 124051
NEW JERSEY 117131
GEORGIA 104351
NORTH CAROLINA 102466
VIRGINIA 93154
MASSACHUSETTS 80688
INDIANA 75784
WASHINGTON 73764
MISSOURI 73083
MARYLAND 73029
WISCONSIN 72443
TENNESSEE 71702
ARIZONA 69662
MINNESOTA 66470
COLORADO 58274
ALABAMA 54453
LOUISIANA 50344
KENTUCKY 49595
CONNECTICUT 47113
SOUTH CAROLINA 46155
OKLAHOMA 43428
OREGON 42039
IOWA 38229
UTAH 36476
KANSAS 36469
MISSISSIPPI 33085
ARKANSAS 32533
NEVADA 27545
NEBRASKA 24571
NEW MEXICO 22485
WEST VIRGINIA 21149
IDAHO 20596
NEW HAMPSHIRE 19121
MAINE 18213
HAWAII 16304
RHODE ISLAND 13802
DELAWARE 12025
MONTANA 11661
SOUTH DAKOTA 11111
VERMONT 10082
ALASKA 9770
NORTH DAKOTA 9614
WYOMING 7457
DIST OF COLUMBIA 6487
;
run;
***** calculating the cumulative frequencies by hand ;
proc sql;
create table dummy_3 as
select
state,
enrollment,
sum(enrollment) as total_enroll,
enrollment / calculated total_enroll as percent_total
from state_counts
order by percent_total desc ;
quit;
data dummy_4; set dummy_3;
if first.percent_total then cum_percent = 0;
cum_percent + percent_total;
run;
Based on the value for cum_percent, the states that make up the top third of enrollment are: California, Texas, New York, Florida, and Illinois.
Is there any way to do this programatically? I'd eventually like to specify a flag variable for selecting states.
Thanks...
You can easily count percentages using PROC FREQ with WEIGHT statement and then select those in the first third using LAG function:
proc freq data=state_counts noprint order=data;
tables state / out=state_counts2;
weight enrollment;
run;
data top3rd;
set state_counts2;
cum_percent+percent;
if lag(cum_percent)<100/3 then top_third=1;
run;
It seems like you're 90% of the way there. If you just need a way to put cum_percent into flagged buckets, setting up a format is pretty straightforward.
proc format;
value pctile
low-0.33333 = 'top third'
0.33333<-.4 = 'next bit'
0.4<-high = 'the rest'
;
run;
options fmtsearch=(work);
And add a statement at the end of your datastep:
pctile_flag = put(cum_percent,pctile.);
Rewrite your last data step like this:
data dummy_4(drop=found);
set dummy_3;
retain cum_percent 0 found 0;
cum_percent + percent_total;
if cum_percent < (1/3) then do;
top_third = 1;
end;
else if ^found then do;
top_third = 1;
found =1;
end;
else
top_third = 0;
run;
note: your first. syntax is incorrect. first. and last. only work on BY groups. You get the right values in CUM_PERCENT by way of the cum_percent + percent_total; statement.
I am not aware of a PROC that will do this for you.