I have a dataset that looks something like this:
IDnum State Product Consumption
123 MI A 30
123 MI B 20
123 MI C 45
456 NJ A 15
456 NJ D 10
789 MI B 60
... ... ... ...
And i would like to create a new dataset, where i have one row for each IDnum, and a new dummy variable for each different product (in my real dataset i have close to 1000 products), along with it's associated consumption. It would look like something in these lines
IDnum State Prod.A Cons.A Prod.B Cons.B Prod.C Cons.C Prod.D Cons.D
123 MI yes 30 yes 20 yes 45 no -
456 NJ yes 15 no - no - yes 10
789 MI no - yes 60 no - no -
... ... ... ... ... ... ... ... ... ...
Some variables like "State" doesn't change within the same IDnum, but each row in the original bank are equivalent to one purchase, hence the change in the "product" and "consumption" variables for the same IDnum. I would like that my new dataset showed all the consumption habits of each costumer in one single row, but so far i have failed.
Any help would be greatly apreciated.
Without yes/no variables, it's really easy:
data input;
length State $2 Product $1;
input IDnum State Product Consumption;
cards;
123 MI A 30
123 MI B 20
123 MI C 45
456 NJ A 15
456 NJ D 10
789 MI B 60
;
run;
proc transpose data=input out=output(drop=_NAME_) prefix=Cons_;
var Consumption;
id Product;
by IDnum State;
run;
Adding the yes/no fields:
proc sql;/* from column names or alternatively
create it from source data directly if not taking too long */
create table work.products as
select scan(name, 2, '_') as product length=1
from dictionary.columns
where libname='WORK' and memname='OUTPUT'
and upcase(name) like 'CONS_%';
quit;
filename vars temp;/* write a temp file containing variable definitions
in desired order */
data _null_;
set work.products end=last;
file vars;
length str $40;
if _N_ = 1 then put 'LENGTH ';
str = catt('Prod_', product, ' $3');
put str;
str = catt('Cons_', product, ' 8');
put str;
if last then put ';';
run;
options source2;
data output2;
length IdNum 8 State $2;
%include vars;
set output;
array prod{*} Prod_:;
array cons{*} Cons_:;
drop i;
do i=1 to dim(prod);
if coalesce(cons(i), 0) ne 0 then prod(i)='yes';
else prod(i)='no';
end;
run;
Related
I tried searching but couldn't exactly find what I was looking for. I have a dataset with multiple rows per ID. I'd like to add a variable called maxdec and show a 1 for each row that has the max dec for each ID.
Sample Dataset:
ID DEC
123 1
123 2
123 2
123 2
456 2
456 3
456 3
Desired Output:
ID DEC MAXDEC
123 1 .
123 2 1
123 2 1
123 2 1
456 2 .
456 2 .
456 3 1
It is easier to define it with 1 or 0 instead of 1 or missing.
proc sql;
create table want as
select id,dec, dec=max(dec) as maxdec
from have
group by id
;
quit;
proc sort data=have;
by id;
proc summary data=have;
class id;
var dec;
output out=max_info max=max_value;
run;
data want;
merge have
max_info (keep=id max_value)
;
by id;
if dec=max_value then maxdec=1;
run;
The proc summary calculates the maximum value of DEC for each ID, and outputs as variable MAX_VALUE in dataset MAX_INFO. The subsequent data step assigns MAXDEC=1 if the current value of DEC is equal to MAX_VALUE for that ID.
Here is a DoW loop approach
data have;
input ID DEC;
datalines;
123 1
123 2
123 2
123 2
456 2
456 3
456 3
;
data want(drop = m);
do _N_ = 1 by 1 until (last.id);
set have;
by id;
m = max(maxdex, dec);
end;
do _N_ = 1 to _N_;
set have;
maxdex = ifn(dec = m, 1, .);
output;
end;
run;
I am trying to count number of times all the values appear in the entire dataset. So I want a table/output with values - # of times it appears in the dataset. I have used proc sql, proc freq without any luck.
data Data1;
input xx yy zz;
datalines;
123 456 234
456 123 345
234 345 123
;
run;
Want a table output with 123 - 3, 234 - 2, etc.
The easiest option (I think) is to create a dataset that puts all the values in a single column, then you can just run a proc freq off that.
data have;
input xx yy zz;
datalines;
123 456 456
456 123 234
234 234 123
;
run;
data single_column;
set have;
array vars{*} xx yy zz;
do i = 1 to dim(vars);
all_vals = vars{i};
output;
end;
keep all_vals;
run;
proc freq data=single_column;
table all_vals / out=want;
run;
I have a group of numbers, each labeled by a group letter, like
Group | x | y
A 135 12
B 281 32
C 221 2
A 201 4
B 294 4
C 950 ... etc
I am trying to run ttest on it, but ONLY on groups with prefix A or C
I cannot use "data = " statement.
So far I have
proc ttest where group = 'A', 'C'
var x y;
run;
But this doesnt work. Any help?
Here you go:
proc ttest data=dataname;
where Group="A" OR Group="C";
var x y;
run;
You can use OR but then you need to list the variable each time:
Where Group = 'A' OR Group = 'B';
Or you can use IN
Where Group in ('A', 'B');
Here's a worked example. Check the results of the check_where table. And look at the different results for the t-test, specifically the different p-values and N to show that you're using different data. Good Luck.
data have;
input Group $ x y;
cards;
A 135 12
B 281 32
C 221 2
A 201 4
B 294 4
C 950 8
;
run;
data check_where;
set have;
where group='A' or 'C';
run;
proc ttest data=have;
where group = 'A' or 'C';
var x y;
run;
proc ttest data=have;
where group in ('A', 'B');
var x y;
run;
proc ttest;
where group = 'A' or 'C';
var x y;
run;
Say I have two data sets A and B that have identical variables and want to rank values in B based on values in A, not B itself (as "PROC RANK data=B" does.)
Here's a simplified example of data sets A, B and want (the desired output):
A:
obs_A VAR1 VAR2 VAR3
1 10 100 2000
2 20 300 1000
3 30 200 4000
4 40 500 3000
5 50 400 5000
B:
obs_B VAR1 VAR2 VAR3
1 15 150 2234
2 14 352 1555
3 36 251 1000
4 41 350 2011
5 60 553 5012
want:
obs VAR1 VAR2 VAR3
1 2 2 3
2 2 4 2
3 4 3 1
4 5 4 3
5 6 6 6
I come up with a macro loop that involves PROC RANK and PROC APPEND like below:
%macro MyRank(A,B);
data AB; set &A &B; run;
%do i=1 %to 5;
proc rank data=AB(where=(obs_A ne . OR obs_B=&i) out=tmp;
var VAR1-3;
run;
proc append base=want data=tmp(where=(obs_B=&i) rename=(obs_B=obs)); run;
%end;
%mend;
This is ok when the number of observations in B is small. But when it comes to very large number, it takes so long and thus wouldn't be a good solution.
Thanks in advance for suggestions.
I would create formats to do this. What you're really doing is defining ranges via A that you want to apply to B. Formats are very fast - here assuming "A" is relatively small, "B" can be as big as you like and it's always going to take just as long as it takes to read and write out the B dataset once, plus a couple read/writes of A.
First, reading in the A dataset:
data ranking_vals;
input obs_A VAR1 VAR2 VAR3;
datalines;
1 10 100 2000
2 20 300 1000
3 30 200 4000
4 40 500 3000
5 50 400 5000
;;;;
run;
Then transposing it to vertical, as this will be the easiest way to rank them (just plain old sorting, no need for proc rank).
data for_ranking;
set ranking_vals;
array var[3];
do _i = 1 to dim(var);
var_name = vname(var[_i]);
var_value = var[_i];
output;
end;
run;
proc sort data=for_ranking;
by var_name var_value;
run;
Then we create a format input dataset, and use the rank as the label. The range is (previous value -> current value), and label is the rank. I leave it to you how you want to handle ties.
data for_fmt;
set for_ranking;
by var_name var_value;
retain prev_value;
if first.var_name then do; *initialize things for a new varname;
rank=0;
prev_value=.;
hlo='l'; *first record has 'minimum' as starting point;
end;
rank+1;
fmtname=cats(var_name,'F');
start=prev_value;
end=var_value;
label=rank;
output;
if last.var_name then do; *For last record, some special stuff;
start=var_value;
end=.;
hlo='h';
label=rank+1;
output; * Output that 'high' record;
start=.;
end=.;
label=.;
hlo='o';
output; * And a "invalid" record, though this should never happen;
end;
prev_value=var_value; * Store the value for next row.;
run;
proc format cntlin=for_fmt;
quit;
And then we test it out.
data test_b;
input obs_B VAR1 VAR2 VAR3;
var1r=put(var1,var1f.);
var2r=put(var2,var2f.);
var3r=put(var3,var3f.);
datalines;
1 15 150 2234
2 14 352 1555
3 36 251 1000
4 41 350 2011
5 60 553 5012
;;;;
run;
One way that you can rank by a variable from a separate dataset is by using proc sql's correlated subqueries. Essentially you counts the number of lower values in the lookup dataset for each value in the data to be ranked.
proc sql;
create table want as
select
B.obs_B,
(
select count(distinct A.Var1) + 1
from A
where A.var1 <= B.var1.
) as var1
from B;
quit;
Which can be wrapped in a macro. Below, a macro loop is used to write each of the subqueries. It looks through the list of variable and parametrises the subquery as required.
%macro rankBy(
inScore /*Dataset containing data to be ranked*/,
inLookup /*Dataset containing data against which to rank*/,
varID /*Variable by which to identify an observation*/,
varsRank /*Space separated list of variable names to be ranked*/,
outData /*Output dataset name*/);
/* Rank variables in one dataset by identically named variables in another */
proc sql;
create table &outData. as
select
scr.&varID.
/* Loop through each variable to be ranked */
%do i = 1 %to %sysfunc(countw(&varsRank., %str( )));
/* Store the variable name in a macro variable */
%let var = %scan(&varsRank., &i., %str( ));
/* Rank: count all the rows with lower value in lookup */
, (
select count(distinct lkp&i..&var.) + 1
from &inLookup. as lkp&i.
where lkp&i..&var. <= scr.&var.
) as &var.
%end;
from &inScore. as scr;
quit;
%mend rankBy;
%rankBy(
inScore = B,
inLookup = A,
varID = obs_B,
varsRank = VAR1 VAR2 VAR3,
outData = want);
Regarding speed, this will be slow if your A is large, but should be okay for large B and small A.
In rough testing on a slow PC I saw:
A: 1e1 B: 1e6 time: ~1s
A: 1e2 B: 1e6 time: ~2s
A: 1e3 B: 1e6 time: ~5s
A: 1e1 B: 1e7 time: ~10s
A: 1e2 B: 1e7 time: ~12s
A: 1e4 B: 1e6 time: ~30s
Edit:
As Joe points out below the length of time the query takes depends not just on the number of observations in the dataset, but how many unique values exist within the data. Apparently SAS performs optimisations to reduce the comparisons to only the distinct values in B, thereby reducing the number of times the elements in A need to be counted. This means that if the dataset B contains a large number of unique values (in the ranking variables) the process will take significantly longer then the times shown. This is more likely to happen if your data is not integers as Joe demonstrates.
Edit:
Runtime test rig:
data A;
input obs_A VAR1 VAR2 VAR3;
datalines;
1 10 100 2000
2 20 300 1000
3 30 200 4000
4 40 500 3000
5 50 400 5000
;
run;
data B;
do obs_B = 1 to 1e7;
VAR1 = ceil(rand("uniform")* 60);
VAR2 = ceil(rand("uniform")* 500);
VAR3 = ceil(rand("uniform")* 6000);
output;
end;
run;
%let start = %sysfunc(time());
%rankBy(
inScore = B,
inLookup = A,
varID = obs_B,
varsRank = VAR1 VAR2 VAR3,
outData = want);
%let time = %sysfunc(putn(%sysevalf(%sysfunc(time()) - &start.), time12.2));
%put &time.;
Output:
0:00:12.41
Given the following dataset:.
obs var1 var2 var3
1 123 456 .
2 123 . 789
3 . 456 789
How does one go about to append all the variables into a single variable whilst ignoring the empty observations (denoted by ".")?
Desired output:.
obs var4
1 123
2 123
3 456
4 456
5 789
6 789
Data step:.
data have;
input
var1 var2 var3; cards;
123 456 .
123 . 789
. 456 789
;run;
Not sure why you read the numbers in as char, but if I change to num, it could be done like this:
data have;
input var1 var2 var3;
cards;
123 456 .
123 . 789
. 456 789
;run;
data want (keep=var4);
set have;
var4=var1;if var4 ne . then output;
var4=var2;if var4 ne . then output;
var4=var3;if var4 ne . then output;
run;
OK, let's assume you have a file vith the values in it, and you do not know how many variables are in each row. First I need to create a sample textfile:
filename x temp;
data _nulL_;
file x;
put "123 456 . ";
put "123 . 789 ";
put ". 456 789 ";
run;
Then I need to read the first line and count the number of variables:
data _null_;
infile x;
input;
call symputx("number_of_variables",put(countw(_infile_," ","c"),best.));
stop;
run;
%put &number_of_variables;
Now I can dynamically read the variables:
%macro doit();
data have;
infile x;
input
%do i=1 %to &number_of_variables;
var&i
%end;
;
run;
data want (keep=var%eval(&number_of_variables + 1));
set have;
%do i=1 %to &number_of_variables;
var%eval(&number_of_variables + 1)=var&i;
if var%eval(&number_of_variables + 1) ne . then output;
%end;
run;
%mend;
%doit;
You can use proc transpose to do this but there is a trick to doing so. You will need to append a unique identifier to each row, prior to doing the transpose.
I've taken #Stig's sample data and added the observation number to use as a unique identifier:
data have;
input var1 var2 var3;
x = _n_; * ADDING A UNIQUE IDENTIFIER TO EVERY ROW;
cards;
123 456 .
123 . 789
. 456 789
;run;
Then it's simply a case of running proc transpose:
proc transpose data=have out=xx;
by x;
run;
And finally, remove any results where col1 is missing, and add in the observation number:
data want;
obs = _n_;
set xx (keep=col1);
where col1 ne .;
run;
As the order is not important then you can do this in one step, using arrays. As the data step moves through each row, the array enables the variable values to be stored in memory, so you can loop through them. I've set it up so that each time a non-missing value is found, then output it to the new variable.
In creating the array, I've set it to var1--var3, the double dash means all variables between var1 and var3 inclusive. If your real variables are numbered the same way then you can use var1-var3, which means all sequential numbers between the two variables.
data have;
input var1 var2 var3;
datalines;
123 456 .
123 . 789
. 456 789
;
run;
data want;
set have;
array allnums var1--var3;
do i = 1 to dim(allnums);
if not missing(allnums{i}) then do;
var4 = allnums{i};
output;
end;
end;
drop var1--var3 i;
run;