Combine information into a department vector - sas

I want to summarize a dataset by creating a vector that gives information on what departments the id is found in. For example,
data test;
input id dept $;
datalines;
1 A
1 D
1 B
1 C
2 C
3 D
4 A
5 C
5 D
;
run;
I want
id dept_vect
1 1111
2 0010
3 0001
4 1000
5 1001
The position of the elements of the dept_vect is organized alphabetically. So a '1' in the first position means that the id is found in deptartment A and a '1' in the second position means that the id is found in department B. A '0' means the id is not found in the department.
I can solve this problem using a brute force approach
proc transpose data = test out = test1(drop = _NAME_);
by id;
var dept;
run;
data test2;
set test1;
array x[4] $ col1-col4;
array d[4] $ d1-d4;
do i = 1 to 4;
if not missing(x[i]) then do;
if x[i] = 'A' then d[1] = 1;
else if x[i] = 'B' then d[2] = 1;
else if x[i] = 'C' then d[3] = 1;
else if x[i] = 'D' then d[4] = 1;
end;
else leave;
end;
do i = 1 to 4;
if missing(d[i]) then d[i] = 0;
end;
dept_id = compress(d1) || compress(d2) || compress(d3) || compress(d4);
keep id dept_id;
run;
This works but there are a couple of problems. For col4 to appear, I need at least one id to be found on all departments but that could be fixed by creating a dummy id so that id is found on all departments. But the main problem is that this code is not robust. Is there a way to code this so that it would work for any number of departments?

Add a 1 to get a count variable
Transpose using PROC TRANSPOSE
Replace missing with 0
Use CATT() to create desired results.
data have;
input id dept $;
count = 1;
datalines;
1 A
1 D
1 B
1 C
2 C
3 D
4 A
5 C
5 D
;
run;
proc transpose data=test out=wide prefix=dept;
by id;
id dept;
var count;
run;
data want;
set wide;
array _d(*) dept:;
do i=1 to dim(_d);
if missing(_d(i)) then _d(i) = 0;
end;
want = catt(of _d(*));
run;

Maybe TRANSREG can help with this.
data test;
input id dept $;
datalines;
1 A
1 D
1 B
1 C
2 C
3 D
4 A
5 C
5 D
;
run;
proc transreg;
id id;
model class(dept / zero=none);
output design out=dummy(drop=dept);
run;
proc print;
run;
proc summary nway;
class id;
output out=want(drop=_type_) max(dept:)=;
run;
proc print;
run;

Related

(SAS) how to name the column with the value of other variable

given dataset 'temp' looks like this..
index
code1
code2
code3
A
P1
P2
P3
B
P1
P3
P4
C
P2
P4
N1
then I want to make new dataset like this
index
P1
P2
P3
P4
n1
A
1
1
1
0
0
B
1
0
1
1
0
C
0
1
0
1
1
My code is here...
%macro freq;
%do i = 1 %to 3;
%do j = 1 %to 5;
if substr(code&i.,1,1) = "P" then
if input(substr(code&i.,2,1),1.) = &j. then p&j. = 1;
if substr(code&i.,1,1) = "N" then
if input(substr(code&i.,2,1),1.) = &j. then n&j. = 1;
%end;
%end;
%mend;
But it's not cool :(
How can I create a new column whose name is the value of variables(code1, code2,...)?
Is there any other simple way?
How about
data have;
input (index code1 code2 code3)($);
datalines;
A P1 P2 P3
B P1 P3 P4
C P2 P4 N1
;
data temp;
set have;
array c code:;
do over c;
v = c;
d = 1;
output;
end;
run;
proc transpose data = temp out = want(drop = _:);
by index;
id v;
var d;
run;
You can achieve this without a macro by using ARRAY and the VNAME function in a DATA step.
data want;
set have;
/* Initialize flag variables. */
length P1-P4 3 N1 3;
/* Define arrays. */
array code [*] code1-code3;
array flags [*] P1-P4 N1;
/* Loop over the arrays. */
do i = 1 to dim(flags);
flags[i] = 0;
do j = 1 to dim(code);
if vname(flags[i]) = code[j] then flags[i] = 1;
end;
end;
keep index P1-P4 N1;
run;
The simplest way to convert values into variable names is via PROC TRANSPOSE. So first convert your wide dataset into a tall dataset. You could use PROC TRANSPOSE to do that, but to make your target dataset PROC TRANSPOSE will need some numeric variable to transpose. So why not use a data step to make the tall dataset and include a numeric variable that is set to 1.
The PROC TRANSPOSE step will give you a dataset with either a 1 or a missing value for the new variables. You can use PROC STDIZE to change the missing values into zeros.
data have;
input index $ (code1-code3) (:$32.) ;
cards;
A P1 P2 P3
B P1 P3 P4
C P2 P4 N1
;
data tall;
set have ;
array code code1-code3;
length _name_ $32 dummy 8;
retain dummy 1;
do column=1 to dim(code);
_name_=code[column];
if not missing(_name_) then output;
end;
run;
proc transpose data=tall out=want(drop=_name_);
by index ;
id _name_;
var dummy;
run;
proc stdize reponly missing=0 data=want ;
var _numeric_;
run;
One more alternative:
proc transpose data=have out=long;
by index;
var code:;
run;
data long2;
set long;
value = 1;
run;
proc transpose data=long2 out=wide;
by index;
id col1;
var value;
run;
/* Convert missing to zeroes */
data want;
set wide;
array vars _NUMERIC_;
do over vars;
if(vars = .) then vars = 0;
end;
drop _NAME_;
run;
Output:
index P1 P2 P3 P4 N1
A 1 1 1 0 0
B 1 0 1 1 0
C 0 1 0 1 1

How do I select the first 5 observations with regard to duplicates? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have a large dataset containting over 80 000 000 rows sorted by "name" and "income" (with duplicates both for name and income). For the first name I would like to have the 5 lowest incomes. For the second name I would like to have the 5 lowest incomes (but incomes drawn to the first name are then disqualified to be selected). And so on, until the last name (if there are any incomes left at that time).
You first want to rank income within names. So:
proc rank data=yourdata out=temp ties=low;
by name;
var income;
ranks incomerank;
run;
Then you want to filter the 5 lowest incomes by name, so:
proc sql;
create table want as
select distinct *
from temp
where incomerank < 6;
quit;
You will need to sort and track incomes
Use an array to sort and track the lowest five income of a name.
Use a hash to track and check the observance of an income being output and thus ineligible for output by later names.
Example:
An insert sort of eligible low valued incomes is used and will be fast due to only 5 items.
data have;
call streaminit(1234);
do name = 1 to 1e6;
do seq = 1 to rand('integer', 20);
income = rand('integer', 20000, 1000000);
output;
end;
end;
run;
data
want (label='Lowest 5 incomes (first occurring over all names) of each name')
want_barren(keep=name label='Names whose all incomes were previously output for earlier names')
;
array X(5) _temporary_;
if _n_ = 1 then do;
if 0 then set have;
declare hash incomes();
incomes.defineKey('income');
incomes.defineDone();
end;
_maxmin5 = 1e15;
x(1) = 1e15;
x(2) = 1e15;
x(3) = 1e15;
x(4) = 1e15;
x(5) = 1e15;
do _n_ = 1 by 1 until (last.name);
set have;
by name;
if incomes.check() = 0 then continue;
* insert sort - lowest five not observed previously;
if income > _maxmin5 then continue;
do _i_ = 1 to 5;
if income < x(_i_) then do;
do _j_ = 5 to _i_+1 by -1;
x(_j_) = x(_j_-1);
end;
x(_i_) = income;
_maxmin5 = x(5);
incomes.add();
leave;
end;
end;
end;
_outflag = 0;
do _n_ = 1 to _n_;
set have;
if income in x then do;
_outflag = 1;
OUTPUT want;
end;
end;
if not _outflag then
OUTPUT want_barren;
drop _:;
run;
data have;
do n = 1 to 8e5;
do _N_ = 1 to 100;
income = ceil(rand('uniform') * 1e4);
address = cats('Address_', _N_);
output;
end;
end;
run;
data want(drop=c);
if _N_ = 1 then do;
dcl hash h(dataset : 'have(obs=0)', ordered : 'a', multidata : 'y');
h.definekey('income');
h.definedata(all : 'y');
h.definedone();
dcl hiter i('h');
dcl hash inc();
inc.definekey('income');
inc.definedone();
end;
do until (last.n);
set have;
by n;
h.add();
end;
do c = 0 by 0 while (i.next() = 0);
if inc.add() = 0 then do;
c + 1;
output;
end;
if c = 5 then leave;
end;
_N_ = i.first();
_N_ = i.prev();
h.clear();
run;
Here is my interpretation of your problem and a solution.
Suppose a simplified version of your data looks like this and you want the 2 lowest income for each name. For simplicity, I use a numeric variable n as name, but a character var will work as well.
data have;
input n income;
datalines;
1 100
1 200
1 300
2 400
2 100
2 500
3 600
3 200
3 500
;
From this data, my guess is that your logic goes like this:
Start with n = 1.
Output the 2 observations with the lowest income (100 and 200)
Go to the next name (n=2).
Output the 2 observations with the lowest income, that has not already been output (300 and 400). 200 Has been output in the n=1 group.
...And so on...
This gives the desired result below:
data want;
input n income;
datalines;
1 100
1 200
2 300
2 400
3 500
;
Try out the solution below and verify that you get the result as posted above.
data want(drop=c);
if _N_ = 1 then do;
dcl hash h(ordered : 'a', multidata : 'y');
h.definekey('income');
h.definedone();
dcl hiter i('h');
dcl hash inc();
inc.definekey('income');
inc.definedone();
end;
do until (last.n);
set have;
by n;
h.add();
end;
do c = 0 by 0 while (i.next() = 0);
if inc.add() = 0 then do;
c + 1;
output;
end;
if c = 2 then leave;
end;
_N_ = i.first();
_N_ = i.prev();
h.clear();
run;
Finally, let us create representable example data with 80Mio obs. I change the if c = 2 then leave; statement to if c = 5 then leave; to go back to your actual problem.
The code below runs in about 45 sec on my system and processes the data in a single pass. Let me know is it works for you :-)
data have;
do n = 1 to 8e5;
do _N_ = 1 to 100;
income = ceil(rand('uniform') * 1e4);
output;
end;
end;
run;
data want(drop=c);
if _N_ = 1 then do;
dcl hash h(ordered : 'a', multidata : 'y');
h.definekey('income');
h.definedone();
dcl hiter i('h');
dcl hash inc();
inc.definekey('income');
inc.definedone();
end;
do until (last.n);
set have;
by n;
h.add();
end;
do c = 0 by 0 while (i.next() = 0);
if inc.add() = 0 then do;
c + 1;
output;
end;
if c = 5 then leave;
end;
_N_ = i.first();
_N_ = i.prev();
h.clear();
run;

SAS for following scenario (most frequent observation)

Assume I have a data-set D1 as follows:
ID ATR1 ATR2 ATR3
1 A R W
2 B T X
1 A S Y
2 C T E
3 D U I
1 T R W
2 C X X
I want to create a data-set D2 from this as follows
ID ATR1 ATR2 ATR3
1 A R W
2 C T X
3 D U I
In other words, Data-set D2 consists of unique IDs from D1. For each ID in D2, the values of ATR1-ATR3 are selected as the most frequent (of the respective variable) among the records in D1 with the same ID. For example ID = 1 in D2 has ATR1 = A (most frequent).
I have one solution which is very clumsy. I simply sort copies of the data set `D1' three times (by ID and ATR1 e.g) and remove duplicates. I later merge the three data-sets to get what I want. However, I think there might be an elegant way to do this. I have about 20 such variables in the original data-set.
Thanks
/*
read and restructure so we end up with:
id attr_id value
1 1 A
1 2 R
1 3 W
etc.
*/
data a(keep=id attr_id value);
length value $1;
array attrs_{*} $ 1 attr_1 - attr_3;
infile cards;
input id attr_1 - attr_3;
do attr_id=1 to dim(attrs_);
value = attrs_{attr_id};
output;
end;
cards;
1 A R W
2 B T X
1 A S Y
2 C T E
3 D U I
1 T R W
2 C X X
;
run;
/* calculate frequencies of values per id and attr_id */
proc freq data=a noprint;
tables id*attr_id*value / out=freqs(keep=id attr_id value count);
run;
/* sort so the most frequent value per id and attr_id ends up at the bottom of the group.
if there are ties then it's a matter of luck which value we get */
proc sort data = freqs;
by id attr_id count;
run;
/* read and recreate the original structure. */
data b(keep=id attr_1 - attr_3);
retain attr_1 - attr_3;
array attrs_{*} $ 1 attr_1 - attr_3;
set freqs;
by id attr_id;
if first.id then do;
do i=1 to dim(attrs_);
attrs_{i} = ' ';
end;
end;
if last.attr_id then do;
attrs_{attr_id} = value;
end;
if last.id then do;
output;
end;
run;

Sum Vertically for a By Condition

I checked out this previous post (LINK) for potential solution, but still not working. I want to sum across rows using the ID as the common identifier. The num variable is constant. The id and comp the two variables I want to use to creat a pct variable, which = sum of [comp = 1] / num
Have:
id Comp Num
1 1 2
2 0 3
3 1 1
2 1 3
1 1 2
2 1 3
Want:
id tot pct
1 2 100
2 3 0.666666667
3 1 100
Currently have:
proc sort data=have;
by id;
run;
data want;
retain tot 0;
set have;
by id;
if first.id then do;
tot = 0;
end;
if comp in (1) then tot + 1;
else tot + 0;
if last.id;
pct = tot / num;
keep id tot pct;
output;
run;
I use SQL for things like this. You can do it in a Data Step, but the SQL is more compact.
data have;
input id Comp Num;
datalines;
1 1 2
2 0 3
3 1 1
2 1 3
1 1 2
2 1 3
;
run;
proc sql noprint;
create table want as
select id,
sum(comp) as tot,
sum(comp)/count(id) as pct
from have
group by id;
quit;
Hi there is a much more elegant solution to your problem :)
proc sort data = have;
by id;
run;
data want;
do _n_ = 1 by 1 until (last.id);
set have ;
by id ;
tot = sum (tot, comp) ;
end ;
pct = tot / num ;
run;
I hope it is clear. I use sql too because I am new and the DOW loop is rather complicated but in your case its pretty straightforward.

Using SAS, is it possible to get a frequency table where no data exist?

This is a follow-up to my previous post on SO.
I am trying to produce a frequency table of demographics, including race, sex, and ethnicity. One table is a crosstab of race by sex for Hispanic participants in a study. However, there are no Hispanic participants thus far. So, the table will be all zeroes, but we still have to report it.
This can be done in R, but so far, I have found no solution for SAS. Example data is below.
data race;
input race eth sex ;
cards;
1 2 1
1 2 1
1 2 2
2 2 1
2 2 2
2 2 1
3 2 2
3 2 2
3 2 1
4 2 2
4 2 1
4 2 2
run;
data class;
do race = 1,2,3,4,5,6,7;
do eth = 1,2,3;
do sex = 1,2;
output;
end;
end;
end;
run;
proc format;
value frace 1 = "American Indian / AK Native"
2 = "Asian"
3 = "Black or African American"
4 = "Native Hawiian or Other PI"
5 = "White"
6 = "More than one race"
7 = "Unknown or not reported" ;
value feth 1 = "Hispanic or Latino"
2 = "Not Hispanic or Latino"
3 = "Unknown or Not reported" ;
value fsex 1 = "Male"
2 = "Female" ;
run;
***** ethnicity by sex ;
proc tabulate data = race missing classdata=class ;
class race eth sex ;
table eth, sex / misstext = '0' printmiss;
format race frace. eth feth. sex fsex. ;
run;
***** race by sex ;
proc tabulate data = race missing classdata=class ;
class race eth sex ;
table race, sex / misstext = '0' printmiss;
format race frace. eth feth. sex fsex. ;
run;
***** race by sex, for Hispanic only ;
***** log indicates that a logical page with only missing values has been deleted ;
***** Thanks SAS, you're a big help... ;
proc tabulate data = race missing classdata=class ;
where eth = 1 ;
class race eth sex ;
table race, sex / misstext = '0' printmiss;
format race frace. eth feth. sex fsex. ;
run;
I understand that the code really can't work because I'm selecting where eth is equal to 1 (there are no cases satisfying the condition...). Specifying the command to be run by eth doesn't work either.
Any guidance is greatly appreciated...
I think the easiest way is to create a row in the data that has the missing value. You could look at the following paper for suggestions as to how to do this on a larger scale:
http://www.nesug.org/Proceedings/nesug11/pf/pf02.pdf
PROC FREQ has the SPARSE option, which gives you all possible combinations of all variables in the table (including missing ones), but it doesn't look like that gives you exactly what you need.
Looks like our good friends at Westat have worked with this issue. A description of there solution is shown here.
The code is shown below for convenience, but please cite the original when referenced
PROC FORMAT;
value ethnicf
1 = 'Hispanic or Latino'
2 = 'Not Hispanic or Latino'
3 = 'Unknown (Individuals Not Reporting Ethnicity)';
value racef
1 = 'American Indian or Alaska Native'
2 = 'Asian'
3 = 'Native Hawaiian or Other Pacific Islander'
4 = 'Black or African American'
5 = 'White'
6 = 'More Than One Race'
7 = 'Unknown or Not Reported';
value gndrf
1 = 'Male'
2 = 'Female'
3 = 'Unknown or Not Reported';
RUN;
DATA shelldata;
format ethlbl ethnicf. racelbl racef. gender gndrf.;
do ethcat = 1 to 2;
do ethlbl = 1 to 3;
do racelbl = 1 to 7;
do gender = 1 to 3;
output;
end;
end;
end;
end;
RUN;
DATA test;
input pt $ 1-3 ethlbl gender racelbl ;
cards;
x1 2 1 5
x2 2 1 5
x3 2 1 5
x4 2 1 5
x5 2 1 5
x6 2 2 2
x7 2 2 2
x8 2 2 5
x9 2 2 4
x10 2 2 4
RUN;
DATA enroll;
set test;
if ethlbl = 1 then ethcat = 1;
else ethcat = 2;
format ethlbl ethnicf. racelbl racef. gender gndrf.;
label ethlbl = 'Ethnic Category'
racelbl = 'Racial Categories'
gender = 'Sex/Gender';
RUN;
%MACRO TAB_WHERE;
/* PROC SQL step creates a macro variable whose */
/* value will be the number of observations */
/* meeting WHERE clause criteria. */
PROC SQL noprint;
select count(*)
into :numobs
from enroll
where ethcat=1;
QUIT;
/* PROC FORMAT step to display all numeric values as zero. */
PROC FORMAT;
value allzero low-high=' 0';
RUN;
/* Conditionally execute steps when no observations met criteria. */
%if &numobs=0 %then
%do;
%let fmt = allzero.; /* Print all cell values as zeroes */
%let str = ; /*No Cases in Subset - WHERE cannot be used */
%end;
%else
%do;
%let fmt = 8.0;
%let str = where ethcat = 1;
%end;
PROC TABULATE data=enroll classdata=shelldata missing format=&fmt;
&str;
format racelbl racef. gender gndrf.;
class racelbl gender;
classlev racelbl gender;
keyword n pctn all;
tables (racelbl all='Racial Categories: Total of Hispanic or Latinos'),
gender='Sex/Gender'*N=' ' all='Total'*n='' / printmiss misstext='0'
box=[LABEL=' '];
title1 font=arial color=darkblue h=1.5 'Inclusion Enrollment Report';
title2 ' ';
title3 font=arial color=darkblue h=1' PART B. HISPANIC ENROLLMENT REPORT:
Number of Hispanic or Latinos Enrolled to Date (Cumulative)';
RUN;
%MEND TAB_WHERE;
%TAB_WHERE
I found this paper to be very informative:
Oh No, a Zero Row: 5 Ways to Summarize Absolutely Nothing
The preloadfmt option in proc means (Method 5) is my favorite. Once you create the necessary formats it's not necessary to add dummy data. It's odd that they haven't yet added this option to proc freq.