proc surveyselect alloc option reads my allocation dataset wrongly? - sas

Ok, so I have a dataset that I have to sample based on another dataset's proportions, and I already have an allocation dataset with 2 columns: strata and alloc. When I run the ff code:
proc surveyselect data=have out=want outall method = srs sampsize=10000 seed=1994;
strata strata/alloc = alloc;
id name;
run;
I get this error:
ERROR: The sum of the _ALLOC_ proportions in the data set ALLOC must equal 1.
I checked my allocation dataset and I see that the strata equal to 1. I'm not sure if there's an issue with my dataset or code. I've already sorted the have dataset by strata, and I also sorted the allocation dataset by strata as well. I've been using the same (or similar) script to randomly sample from many different datasets below, so I'm not sure why it isn't working for this one.
Any ideas? Thanks!
Edit: For more info, I'm using SAS Enterprise Guide 7.1.
For reference, the alloc table is as follows (I can't give real strata names, but I've checked and they are identical to the strata in my have dataset):
_alloc_ | strata
0.3636363636 | strata1
0.0909090909 | strata2
0.0909090909 | strata3
0.0909090909 | strata4
0.1818181818 | strata5
0.0909090909 | strata6
0.0909090909 | strata7
I am also perplexed. As I mentioned, this code worked in other datasets except for this one. If there is any correlation at all, I created the alloc dataset using R and imported it to SAS.

Do a distinct count on all the strata values in your dataset:
proc sql noprint;
create table check as
select distinct strata
from have
;
quit;
If there are any extra groups that do not exist in the alloc dataset or vis-versa, your error message will appear. In the example code below, alloc has 7 strata but have has 6 strata.
data alloc;
infile datalines dlm='|';
input _alloc_ strata$;
datalines;
0.3636363636 | strata1
0.0909090909 | strata2
0.0909090909 | strata3
0.0909090909 | strata4
0.1818181818 | strata5
0.0909090909 | strata6
0.0909090909 | strata7
;
run;
/* Only have 6 strata instead of 7 in the data */
data have;
do strata = 'strata1', 'strata2', 'strata3', 'strata4', 'strata5', 'strata6';
do i = 1 to 100;
name = 'name';
output;
end;
end;
run;
proc surveyselect data=have
out=want
outall
method = srs
sampsize=10
seed=1994
;
strata strata / alloc = alloc;
id name;
run;

Related

Value labels to be created using data from another data set

I am having two data sets. The first data set has airport codes (JFK, LGA, EWR) in a variable 'airport'. The second dataset has the list of all major airports in the world. This dataset has two variables 'faa' holding the FAA Code (like JFG, LGA, EWR) and 'name' holding the actual name of the airport (John. F Kennedy, Le Guardia etc.).
My requirement is to create value labels for in the first data set, so that instead of airport code, the actual name of the airport comes up. I know I can use custom formats to achieve this. But can I write SAS code which can read the unique airport codes, then get the names from another data set and create a value label automatically?
PS: Other wise, the only option I see is to use MS Excel to get the unique list of FAA codes in dataset 1, and then use VLOOKUP to get the names of the airports. And then create one custom format by listing each unique FAA code and the airport name.
I think "value label" is SPSS terminology. Looks like you want to create a format. Just use your lookup table to create an input dataset for PROC FORMAT.
So if your second table looks like this:
data table2;
length FAA $4 Name $40 ;
input FAA Name $40. ;
cards;
JFK John F. Kennedy (NYC)
LGA Laguardia (NYC)
EWR Newark (NJ)
;
You can use this code to convert it into a dataset that PROC FORMAT can use to create a format.
data fmt ;
fmtname='$FAA';
hlo=' ';
set table2 (rename=(faa=start name=label));
run;
proc format cntlin=fmt lib=work.formats;
run;
Now you can use that format with your other data.
proc freq data=table1 ;
tables airport ;
format airport faa. ;
run;
Firstly, consider if it is really a format what is needed. For example, you may just do a left join to retrieve the column (airport) name from table2 (FAA-Name table).
Anyway, I believe the following macro does the trick:
Create auxiliary tables:
data have1;
input airport $;
datalines;
a
d
e
;
run;
data have2;
input faa $ name $;
datalines;
a aaaa
b bbbb
c cccc
d dddd
;
run;
Macro to create Format:
%macro create_format;
*count number of faa;
proc sql noprint;
select distinct count(faa) into:n
from have2;
quit;
*create macro variables for each faa and name;
proc sql noprint;
select faa, name
into:faa1-:faa%left(&n),:name1-:name%left(&n)
from have2;
quit;
*create format;
proc format;
value $airport
%do i=1 %to &n;
"&faa%left(&i)" = "&name%left(&i)"
%end;
other = "Unknown FAA code";
run;
%mend create_format;
%create_format;
Apply format:
data want;
set have1;
format airport $airport.;
run;

Average number of rows per variable in SAS

I have the following dataset :
data test;
input business_ID $;
datalines;
'busi1'
'busi1'
'busi1'
'busi2'
'busi3'
'busi3'
;
run;
proc freq data = test ;
table business_ID;
run;
I would like the average nummber of lines per business, that is count the total number of observations and divide it by the number of distinct businesses.
In my example : 6 observations, 3 businesses -> 6/2=3 lines per business.
I was thinking about using a proc freq or a proc mean step but so far I got only the number of lines (~freq) per business and do not know how to get to my goal.
Any idea?
You could use PROC FREQ to get the counts and then run PROC MEANS on the output.
proc freq data=test ;
tables business_id / noprint out=counts ;
run;
proc means data=counts;
var count;
run;
Or you could count them directly with PROC SQL code.
proc sql ;
select count(*)/count(distinct business_id) as mean_count
from test
;
quit;

SAS-use of lead function

Suppose the dataset has three columns
Date Region Price
01-03 A 1
01-03 A 2
01-03 B 3
01-03 B 4
01-03 A 5
01-04 B 4
01-04 B 6
01-04 B 7
I try to get the lead price by date and region through following code.
data want;
set have;
by _ric date_l_;
do until (eof);
set have(firstobs=2 keep=price rename=(price=lagprice)) end=eof;
end;
if last.date_l_ then call missing(lagprice);
run;
However, the WANT only have one observations. Then I create new_date=date and try another code:
data want;
set have nobs=nobs;
do _i = _n_ to nobs until (new_date ne Date);
if eof1=0 then
set have (firstobs=2 keep=price rename=(price=leadprice)) end=eof1;
else leadprice=.;
end;
run;
With this code, SAS is working slowly. So I think this code is also not appropriate. Could anyone give some suggestions? Thanks
Try sorting by the variables you want lead price for then set together twice:
data test;
length Date Region $12 Price 8 ;
input Date $ Region $ Price ;
datalines;
01-03 A 1
01-03 A 2
01-03 B 3
01-03 B 4
01-03 A 5
01-04 B 4
01-04 B 6
01-04 B 7
;
run;
** sort by vars you want lead price for **;
proc sort data = test;
by DATE REGION;
run;
** set together twice -- once for lead price and once for all variables **;
data lead_price;
set test;
by DATE REGION;
set test (firstobs = 2 keep = PRICE rename = (PRICE = LEAD_PRICE))
test (obs = 1 drop = _ALL_);
if last.DATE or last.REGION then do;
LEAD_PRICE = .;
end;
run;
You can use proc expand to generate leads on numeric variables by group. Try the following method instead:
Step 1: Sort by Region, Date
proc sort data=have;
by Region Date;
run;
Step 2: Create a new ID variable to denote observation numbers
Because you have multiple values per date per region, we need to generate a new ID variable so that proc expand uses lead by observation number rather than by date.
data have2;
set have;
_ID_ = _N_;
run;
Step 3: Run proc expand by region with the lead transformation
lead will do exactly as it sounds. You can lead by as many values as you'd like, as long as the data supports it. In this case, we are leading by one observation.
proc expand data=have2
out=want;
by Region;
id _ID_;
convert Price = Lead_Price / transform=(lead 1) ;
run;

creating output from proc freq in SAS

I am running the following SAS code in SAS Enterprise Guide 6.1 to get some summary stats on null/not null for all the variables in a table. This is producing the desired info via the 'results' tab, which creates a separate table for each result showing null/not null frequencies and percentages.
What I'd like to do is put the results into an output dataset with all the variables and stats in a single table.
proc format;
value $missfmt ' '='Missing' other='Not Missing';
value missfmt . ='Missing' other='Not Missing';
run;
proc freq data=mydatatable;
format _CHAR_ $missfmt.;
tables _CHAR_ / out=work.out1 missing missprint nocum;
format _NUMERIC_ missfmt.;
tables _NUMERIC_ / out=work.out2 missing missprint nocum;
run;
out1 and out2 are being generated into tables like this:
FieldName | Count | Percent
Not Missing | Not Missing | Not Missing
But are only populated with one variable each, and the frequency counts are not being shown.
The table I'm trying to create as output would be:
field | Missing | Not Missing | % Missing
FieldName1 | 100 | 100 | 50
FieldName2 | 3 | 97 | 3
The tables statement output options only apply to the last table requested. _CHAR_ resolves to (all character variables), but they're single tables, so you only get the last one requested.
You can get this one of two ways. Either use PROC TABULATE, which more readily deals with lists of variables; or use ODS OUTPUT to grab the proc freq output. Both output styles will take some work likely to get into exactly the structure you want.
ods output onewayfreqs=myfreqs; *use `ODS TRACE` to find this name if you do not know it;
proc freq data=sashelp.class;
tables _character_;
tables _numeric_;
run;
ods output close;

Rolling up data in SAS

Here is my data :
data example;
input id sports_name;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
This is just a sample. The variable sports_name is categorical with 56 types.
I am trying to transpose the data to wide form where each row would have a user_id and the names of sports as the variables with values being 1/0 indicating Presence or absence.
So far, I used proc freq procedure to get the cross tabulated frequency table and put that in a different data set and then transposed that data. Now i have missing values in some cases and count of the sports in rest of the cases.
Is there any better way to do this?
Thanks!!
You need a way to create something from nothing. You could have also used the SPARSE option in PROC FREQ. SAS names cannot have length greater than 32.
data example;
input id sports_name :$16.;
retain y 1;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
;;;;
run;
proc print;
run;
proc summary data=example nway completetypes;
class id sports_name;
output out=freq(drop=_type_);
run;
proc print;
run;
proc transpose data=freq out=wide(drop=_name_);
by id;
var _freq_;
id sports_name;
run;
proc print;
run;
Same theory here, generate a list of all possible combinations using SQL instead of Proc Summary and then transposing the results.
data example;
informat sports_name $20.;
input id sports_name $;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
;
run;
proc sql;
create table complete as
select a.id, a_x.sports_name, case when not missing(e.sports_name) then 1 else 0 end as Present
from (select distinct ID from example) a
cross join (select distinct sports_name from example) a_x
full join example as e
on e.id=a.id
and e.sports_name=a_x.sports_name;
quit;
proc transpose data=complete out=want;
by id;
id sports_name;
var Present;
run;