I am new to sas and are trying to handle some customer data, and I'm not really sure how to do this.
What I have:
data transactions;
input ID $ Week Segment $ Average Freq;
datalines;
1 1 Sports 500 2
1 1 PC 400 3
1 2 Sports 350 3
1 2 PC 550 3
2 1 Sports 650 2
2 1 PC 700 3
2 2 Sports 720 3
2 2 PC 250 3
;
run;
What I want:
data transactions2;
input ID Week1_Sports_Average Week1_PC_Average Week1_Sports_Freq
Week1_PC_Freq
Week2_Sports_Average Week2_PC_Average Week2_Sports_Freq Week2_PC_Freq;
datalines;
1 500 400 2 3 350 550 3 3
2 650 700 2 3 720 250 3 3
;
run;
The only thing I got so far is this:
Data transactions3;
SET transactions;
if week=1 and Segment="Sports" then DO;
Week1_Sports_Freq=Freq;
Week1_Sports_Average=Average;
END;
else DO;
Week1_Sports_Freq=0;
Week1_Sports_Average=0;
END;
run;
This will be way too much work as I have a lot of weeks and more variables than just freq/avg.
Really hoping for some tips are, as I'm stucked.
You can use PROC TRANSPOSE to create that structure. But you need to use it twice since your original dataset is not fully normalized.
The first PROC TRANSPOSE will get the AVERAGE and FREQ readings onto separate rows.
proc transpose data=transactions out=tall ;
by id week segment notsorted;
var average freq ;
run;
If you don't mind having the variables named slightly differently than in your proposed solution you can just use another proc transpose to create one observation per ID.
proc transpose data=tall out=want delim=_;
by id;
id segment _name_ week ;
var col1 ;
run;
If you want the exact names you had before you could add data step to first create a variable you could use in the ID statement of the PROC transpose.
data tall ;
set tall ;
length new_name $32 ;
new_name = catx('_',cats('WEEK',week),segment,_name_);
run;
proc transpose data=tall out=want ;
by id;
id new_name;
var col1 ;
run;
Note that it is easier in SAS when you have a numbered series of variable if the number appears at the end of the name. Then you can use a variable list. So instead of WEEK1_AVERAGE, WEEK2_AVERAGE, ... you would use WEEK_AVERAGE_1, WEEK_AVERAGE_2, ... So that you could use a variable list like WEEK_AVERAGE_1 - WEEK_AVERAGE_5 in your SAS code.
Related
I just performed the fisher test in R and in Excel on a 2x2 table with the contents 1 6 and 7 2. I can't manage to do this in sas.
data my_table;
input var1 var2 ##;
datalines;
1 6 7 2
;
proc freq data=my_table;
tables var1*var2 / fisher;
run;
The test somehow ignores that the table consists of the 4 variables but when I print the table it looks fine. In the test the contents of the table are 0, 1, 1, 0. I guess I need to change something when creating the data but what?
You do NOT have two variables that each have two categories.
Read the data in this way instead.
data have ;
do var1=1,2 ; do var2=1,2;
input count ##;
output;
end; end;
datalines;
1 6 7 2
;
Now VAR1 and VAR2 both have two possible values and COUNT has the number of cases for the particular combination. Use the WEIGHT statement to tell PROC FREQ to use COUNT as the number of cases.
proc freq data=have ;
weight count ;
tables var1*var2 / fisher ;
run;
Assume I have two clusters of clients and for each client I have a value, how much worth the client is for me. I want to now create a summary table with rows sub-summing the all clients' values for that cluster.
Example input table clients:
client cluster value
1 1 3
2 2 1
3 1 4
4 1 1
5 2 5
Expected output wanted:
cluster client value comment
1 1 3 kEuro
1 3 4 kEuro
1 4 1 kEuro
1 . 8 sum for cluster 1
2 2 1 kEuro
2 5 5 kEuro
2 . 6 sum for cluster 2
I am now looking for the most efficient way of achieving this. What I came up with so far is the code below, which does not fully solve it: Instead of sum for cluster x I only see sum f.
proc sql;
create table part_1 as
select cluster, client, value
, "kEuro" as comment
from clients
where cluster = 1;
quit;
proc sql;
create table part_2 as
select cluster, client, value
, "kEuro" as comment
from clients
where cluster = 2;
quit;
/* The summation over the clusters */
proc means noprint data=part1;
output out=psum_1 (drop=_type_ _freq_) sum=;
run;
proc means noprint data=part2;
output out=psum_2 (drop=_type_ _freq_) sum=;
run;
/* The concatenation stage 1 */
data puzzle_1;
set part_1 psum_1 (in=in2);
if in2 then do;
client = .;
comment = 'sum for cluster 1';
end;
run;
data puzzle_2;
set part_2 psum_2 (in=in2);
if in2 then do;
client = .;
comment = 'sum for cluster 2';
end;
run;
/* The concatenation stage 2 */
data wanted;
set puzzle_1 puzzle_2;
run;
Is there a better way to approach this problem, ideally something which I could loop over, if I have more than two clusters?
You can use PROC MEANS/SUMMARY to generate the summary rows. Then use a SET with a BY to add the summary rows to the detail rows.
data have;
input client cluster value ;
cards;
1 1 3
2 2 1
3 1 4
4 1 1
5 2 5
;
proc sort data=have; by cluster client; run;
proc means data=have noprint nway ;
by cluster;
var value;
output out=sum sum= ;
run;
data want;
set have sum ;
by cluster;
length comment $20 ;
if last.cluster then comment=catx(' ','sum for cluster',cluster);
else comment='kEuro';
run;
proc print ;
var cluster client value comment;
run;
Typically you would not add summary rows to the data set containing the details being aggregated. Instead, consider using a reporting tool such as Proc REPORT, Proc TABULATE or Proc SUMMARY
Example:
Show details and summary row using Proc REPORT
data have; input
client cluster value; datalines;
1 1 3
2 2 1
3 1 4
4 1 1
5 2 5
;
ods html file='report.html';
proc report data=have;
columns cluster client value;
define cluster / order;
define client / display;
define value / sum;
break after cluster / summarize style=[fontstyle=italic background=lightgray];
run;
ods html close;
If you absolutely feel the need to make your original data more difficult to use further downstream:
add a CLASS cluster statement to your Proc MEANS
SORT original data by cluster
stack original data with MEANS output using SET original means_output; by cluster;
I have a set of pre and post scores, with values that can be 1 or 2, e.g.:
Pre Post
1 2
1 1
2 2
2 1
1 2
2 1
etc.
I need to create a 2x2 table that lists the frequencies, with percentages ONLY in the total row/column:
1 2 Total
1 14 60 74 / 30%
2 38 12 50 / 20%
Total 52 / 21% 72 / 29% 248
It doesn't need to be formatted specifically with the / between the n and percent, they can be on different lines. I just need to make sure the total percentages (no cumulative percentages) are in the table.
I think that I should use proc tabulate to get this, but I'm new to SAS and haven't been able to figure it out. Any help would be greatly appreciated.
Code I've tried:
proc tabulate data=.bilirubin order=data;
class pre ;
var post ;
table pre , post*( n colpctsum);
run;
You could make your own report. For example you could use PROC SUMMARY to get frequencies. Add a data step to calculate the percent and generate a character variable with the text you want to display. Then use PROC REPORT to display it.
proc summary data=have ;
class pre post ;
output out=summary ;
run;
proc format ;
value total .='Total (%)';
run;
data results ;
set summary ;
length display $20 ;
if _type_=0 then n=_freq_;
retain n;
if _type_ in (0,3) then display = put(_freq_,comma9.);
else display = catx(' ',put(_freq_,comma9.),cats('(',put(_freq_/n,percent8.2),')'));
run;
proc report missing data=results ;
column pre display,post n ;
define pre / group ;
define post / across ;
define n / noprint ;
define display / display ' ';
format pre post total.;
run;
I have a list of financial advisors and I need to pull 4 samples per advisor but catch is in those 4 samples I need to force 2 mortgages, 1 loan, 1 credit card lets say.
Is there a way in the Survey select statement to set the specific number of samples to pull per stratum? I know you can stratify on 1 category and set it as a equal number. I was hoping I could use a mapping of employee names + the number of samples left to pull for each category and have survey select utilize that to pull in a dynamic way.
I'm using this as an example but this only stratifies on employee first and gives me 4 per employee. I would need to further stratify on Product type and set that to a specific sample size per product.
proc surveyselect data=work.Emp_Table_Final
method=srs n=4 out=work.testsample SELECTALL;
strata Employee_No;
run;
Thanks i know it might sound complicated, but if i know its possible then i can google the rest
Yes, you can have a dataset be the target of the n option. That dataset must:
Contain the strata variables as well as a variable SAMPSIZE or _NSIZE_ with the number to select
Have the same type and length as the strata variables
Be sorted by the strata variables
Have an entry for every strata variable value
See the documentation for more details.
data sample_counts;
length sex $1;
input sex $ _NSIZE_;
datalines;
F 5
M 3
;;;;
run;
proc sort data=sashelp.class out=class;
by sex;
run;
proc surveyselect n=sample_counts method=srs out=samples data=class;
strata sex;
run;
For two variables it's the same, you just need two variables in the sample_counts. Of course it makes it a lot more complicated, and you may want to produce this in an automated fashion.
proc sort data=sashelp.class out=class;
by sex age;
run;
data sample_counts;
length sex $1;
input sex $ age _NSIZE_;
datalines;
F 11 1
F 12 1
F 13 1
F 14 1
F 15 1
M 11 1
M 12 1
M 13 1
M 14 1
M 15 1
M 16 0
;;;;
run;
/* or do it in an automated way*/
data sample_counts;
set class;
by sex age; *your strata;
if first.age then do; *do this once per stratum level;
if age le 15 then _NSIZE_ = 1; *whatever your logic is for defining _NSIZE_;
else _NSIZE_=0;
output;
end;
run;
proc surveyselect n=sample_counts method=srs out=samples data=class;
strata sex age;
run;
Here is my data :
data example;
input id sports_name;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
This is just a sample. The variable sports_name is categorical with 56 types.
I am trying to transpose the data to wide form where each row would have a user_id and the names of sports as the variables with values being 1/0 indicating Presence or absence.
So far, I used proc freq procedure to get the cross tabulated frequency table and put that in a different data set and then transposed that data. Now i have missing values in some cases and count of the sports in rest of the cases.
Is there any better way to do this?
Thanks!!
You need a way to create something from nothing. You could have also used the SPARSE option in PROC FREQ. SAS names cannot have length greater than 32.
data example;
input id sports_name :$16.;
retain y 1;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
;;;;
run;
proc print;
run;
proc summary data=example nway completetypes;
class id sports_name;
output out=freq(drop=_type_);
run;
proc print;
run;
proc transpose data=freq out=wide(drop=_name_);
by id;
var _freq_;
id sports_name;
run;
proc print;
run;
Same theory here, generate a list of all possible combinations using SQL instead of Proc Summary and then transposing the results.
data example;
informat sports_name $20.;
input id sports_name $;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
;
run;
proc sql;
create table complete as
select a.id, a_x.sports_name, case when not missing(e.sports_name) then 1 else 0 end as Present
from (select distinct ID from example) a
cross join (select distinct sports_name from example) a_x
full join example as e
on e.id=a.id
and e.sports_name=a_x.sports_name;
quit;
proc transpose data=complete out=want;
by id;
id sports_name;
var Present;
run;