I have a dataset that I want to tranpose from long to wide. I have:
**ID **Question** Answer**
1 Referral to a
1 Referral to b
1 Referral to d
2 Referral to a
2 Referral to c
4 Referral to a
6 Referral to a
6 Referral to c
6 Referral to d
What I want the tranposed dataset to look like:
**ID **Referral to**
1 a, b, d
2 a, c
4 a
6 a, c, d
I've tried to transpose the data, but the resulting dataset only contains 1 of the responses from the answer column, not all of them.
Code I've been using:
proc transpose data=test out=test2 let;
by ID;
id Question;
var Answer; run;
The dataset has hundreds of thousands of rows with dozens of variables that are exactly the same as the 'Referral to' example. How can make it so the tranposed wide dataset contains all of the answers to the Question in the same cell and not just one? Any help would be appreciated.
Thank you.
Here's two methods you can use in this case.
The first uses a data step approach, which is a single step. The second is more dynamic and uses a PROC TRANSPOSE + CATX() after the fact to create the combined variable. Note the use of PREFIX option in the transpose procedure to make the variables easier to identify and concatenate.
*create sample data for demonstration;
data have;
infile cards dlm='09'x;
input OrgID Product $ States $;
cards;
1 football DC
1 football VA
1 football MD
2 football CA
3 football NV
3 football CA
;
run;
*Sort - required for both options;
proc sort data=have;
by orgID;
run;
**********************************************************************;
*Use RETAIN and BY group processing to combine the information;
**********************************************************************;
data want_option1;
set have;
by orgID;
length combined $100.;
retain combined;
if first.orgID then
combined=states;
else
combined=catx(', ', combined, states);
if last.orgID then
output;
run;
**********************************************************************;
*Transpose it to a wide format and then combine into a single field;
**********************************************************************;
proc transpose data=have out=wide prefix=state_;
by orgID;
var states;
run;
data want_option2;
set wide;
length combined $100.;
combined=catx(', ', of state_:);
run;
Related
I have a sas datebase with something like this:
id birthday Date1 Date2
1 12/4/01 12/4/13 12/3/14
2 12/3/01 12/6/13 12/2/14
3 12/9/01 12/4/03 12/9/14
4 12/8/13 12/3/14 12/10/16
And I want the data in this form:
id Date Datetype
1 12/4/01 birthday
1 12/4/13 1
1 12/3/14 2
2 12/3/01 birthday
2 12/6/13 1
2 12/2/14 2
3 12/9/01 birthday
3 12/4/03 1
3 12/9/14 2
4 12/8/13 birthday
4 12/3/14 1
4 12/10/16 2
Thanks by ur help, i'm on my second week using sas <3
Edit: thanks by remain me that i was not finding a sorting method.
Good day. The following should be what you are after. I did not come up with an easy way to rename the columns as they are not in beginning data.
/*Data generation for ease of testing*/
data begin;
input id birthday $ Date1 $ Date2 $;
cards;
1 12/4/01 12/4/13 12/3/14
2 12/3/01 12/6/13 12/2/14
3 12/9/01 12/4/03 12/9/14
4 12/8/13 12/3/14 12/10/16
; run;
/*The trick here is to use date: The colon means everything beginning with date, comparae with sql 'date%'*/
proc transpose data= begin out=trans;
by id;
var birthday date: ;
run;
/*Cleanup. Renaming the columns as you wanted.*/
data trans;
set trans;
rename _NAME_= Datetype COL1= Date;
run;
See more from Kent University site
Two steps
Pivot the data using Proc TRANSPOSE.
Change the names of the output columns and their labels with PROC DATASETS
Sample code
proc transpose
data=have
out=want
( keep=id _label_ col1)
;
by id;
var birthday date1 date2;
label birthday='birthday' date1='1' date2='2' ; * Trick to force values seen in pivot;
run;
proc datasets noprint lib=work;
modify want;
rename
_label_ = Datetype
col1 = Date
;
label
Datetype = 'Datetype'
;
run;
The column order in the TRANSPOSE output table is:
id variables
copy variables
_name_ and _label_
data based column names
The sample 'want' shows the data named columns before the _label_ / _name_ columns. The only way to change the underlying column order is to rewrite the data set. You can change how that order is perceived when viewed is by using an additional data view, or an output Proc that allows you to specify the specific order desired.
I have a dataset as following
AGE GENDER
11 F
12 M
13
15
now I want to create a dataset as following
Basically I want to have the variable names in another column.
or may be in one column like
VAR Value
AGE 11
AGE 12
AGE 13
AGE 15
GENDER F
GENDER M
I have tried normal proc transpose, but looks like it doesnt give the desired result.
This is not a strictly speaking a transpose. Transpose implies that you want to transform some columns into rows or vice-versa, which is not the case here. That sample data transposed would look like:
VAR VALUE1 VALUE2 VALUE3 VALUE4
----------------------------------
AGE 11 12 13 14
GENDER F M
What you're trying to do here instead is have all your variables in the same column and add a 'label' column.
You could have your desired result with a data step:
data have;
infile datalines missover
;
input age $ gender $;
datalines;
11 F
12 M
13
15
;
run;
data want;
length var $6;
set have(keep=age rename=(age=value) in=a)
have(keep=gender rename=(gender=value) where=(value is not missing) in=b);
if b then var='GENDER';
else if a then var='AGE';
run;
Note the where= dataset option on the second part of the set statement since your desired result does not include the missing values that you have for gender in your sample data.
Alternatively, you could do it with two proc transpose:
proc transpose data=have out=temp name=VAR;
var age gender;
run;
proc transpose data=temp out=want(drop=_name_ rename=(col1=VALUE) where=(VALUE is not missing));
var col1 col2 col3 col4;
by var;
run;
One solution is to introduce a new unique row identifier and use that in a BY statement. This will let TRANSPOSE pivot the data values in each row.
data have;
rownum + 1; * new variable for pivoting by row via BY statement;
input AGE GENDER $;
datalines;
11 F
12 M
13 .
15 .
run;
proc transpose data=have out=want(drop=_name_ rename=(col1=value) where=(value ne ''));
by rownum;
var age gender;
run;
In Proc TRANPOSE the default new column names are prefixed with COL and indexed by the number of occurrences of a value 1..n in the incoming rows. The artificial rownum and BY statement ensure the pivoted data has only one data column. Note: the prefix can be specified with option PREFIX=, and additionally the pivoted data column names can come from the data itself if you use the ID statement.
Mixed data types can be a problem because the new column will use character representation of underlying data values. So dates will come out as numbers and numeric that were initially formatted will lose their format.
If you are trying to make a JSON transmission I would recommend researching the JSON library engine or the JSON package of Proc DS2.
If you are looking to create a report with the data in this transposed shape I would recommend Proc TABULATE.
I have a list of financial advisors and I need to pull 4 samples per advisor but catch is in those 4 samples I need to force 2 mortgages, 1 loan, 1 credit card lets say.
Is there a way in the Survey select statement to set the specific number of samples to pull per stratum? I know you can stratify on 1 category and set it as a equal number. I was hoping I could use a mapping of employee names + the number of samples left to pull for each category and have survey select utilize that to pull in a dynamic way.
I'm using this as an example but this only stratifies on employee first and gives me 4 per employee. I would need to further stratify on Product type and set that to a specific sample size per product.
proc surveyselect data=work.Emp_Table_Final
method=srs n=4 out=work.testsample SELECTALL;
strata Employee_No;
run;
Thanks i know it might sound complicated, but if i know its possible then i can google the rest
Yes, you can have a dataset be the target of the n option. That dataset must:
Contain the strata variables as well as a variable SAMPSIZE or _NSIZE_ with the number to select
Have the same type and length as the strata variables
Be sorted by the strata variables
Have an entry for every strata variable value
See the documentation for more details.
data sample_counts;
length sex $1;
input sex $ _NSIZE_;
datalines;
F 5
M 3
;;;;
run;
proc sort data=sashelp.class out=class;
by sex;
run;
proc surveyselect n=sample_counts method=srs out=samples data=class;
strata sex;
run;
For two variables it's the same, you just need two variables in the sample_counts. Of course it makes it a lot more complicated, and you may want to produce this in an automated fashion.
proc sort data=sashelp.class out=class;
by sex age;
run;
data sample_counts;
length sex $1;
input sex $ age _NSIZE_;
datalines;
F 11 1
F 12 1
F 13 1
F 14 1
F 15 1
M 11 1
M 12 1
M 13 1
M 14 1
M 15 1
M 16 0
;;;;
run;
/* or do it in an automated way*/
data sample_counts;
set class;
by sex age; *your strata;
if first.age then do; *do this once per stratum level;
if age le 15 then _NSIZE_ = 1; *whatever your logic is for defining _NSIZE_;
else _NSIZE_=0;
output;
end;
run;
proc surveyselect n=sample_counts method=srs out=samples data=class;
strata sex age;
run;
Here is my data :
data example;
input id sports_name;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
This is just a sample. The variable sports_name is categorical with 56 types.
I am trying to transpose the data to wide form where each row would have a user_id and the names of sports as the variables with values being 1/0 indicating Presence or absence.
So far, I used proc freq procedure to get the cross tabulated frequency table and put that in a different data set and then transposed that data. Now i have missing values in some cases and count of the sports in rest of the cases.
Is there any better way to do this?
Thanks!!
You need a way to create something from nothing. You could have also used the SPARSE option in PROC FREQ. SAS names cannot have length greater than 32.
data example;
input id sports_name :$16.;
retain y 1;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
;;;;
run;
proc print;
run;
proc summary data=example nway completetypes;
class id sports_name;
output out=freq(drop=_type_);
run;
proc print;
run;
proc transpose data=freq out=wide(drop=_name_);
by id;
var _freq_;
id sports_name;
run;
proc print;
run;
Same theory here, generate a list of all possible combinations using SQL instead of Proc Summary and then transposing the results.
data example;
informat sports_name $20.;
input id sports_name $;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
;
run;
proc sql;
create table complete as
select a.id, a_x.sports_name, case when not missing(e.sports_name) then 1 else 0 end as Present
from (select distinct ID from example) a
cross join (select distinct sports_name from example) a_x
full join example as e
on e.id=a.id
and e.sports_name=a_x.sports_name;
quit;
proc transpose data=complete out=want;
by id;
id sports_name;
var Present;
run;
Hell everyone --
I have some sales data which looks like this:
data have;
input order_id item $;
cards;
1 A
1 B
2 A
2 C
3 B
4 A
4 B
;
run;
What I'm trying to find out is what are the most popular combinations of items ordered. For example in the above case, there were 2 orders that contained items A&B, 1 order of A&C, and 1 order of B. What would be the best way to output the different combinations along with the numbers of orders placed?
It seems there is no permutation issue, you could try this:
proc sort data=have;
by order_id item;
run;
data temp;
set have;
by order_id;
retain comb;
length comb $4;
comb=cats(comb,item);
if last.order_id then do;
output;
call missing(comb);
end;
run;
proc freq data=temp;
table comb/norow nopercent nocol nocum;
run;
There are many possible approaches to this problem, and I would not presume to say which is the best. Here's a fairly simple method you could use:
Transpose your data so that you only have 1 row for each order, with an indicator variable for each product.
Feed the transposed dataset into proc corr to produce a correlation matrix for the indicator variables, and look for the strongest correlations.