I have data like below (Dataset name - Have)
ID NAME AMOUNT PREFER
ABC Test1 123 Pref1
ABC Test1 456 Pref1
ABC Test1 789 Pref1
ABC Test1 123 Pref2
ABC Test1 456 Pref2
ABC Test1 789 Pref2
and i want First Group only as output
ID NAME AMOUNT PREFER
ABC Test1 123 Pref1
ABC Test1 456 Pref1
ABC Test1 789 Pref1
Tried so far. simple data step like
Data want;
set have;
by ID PREFER;
if first.PREFER;
run;
This will give me
ID NAME AMOUNT PREFER
ABC Test1 123 Pref1
ABC Test1 123 Pref2
Please suggest something in Data Step or Proc SQL
It sounds as though you probably want something like this:
data have;
input ID $ NAME $ AMOUNT PREFER $;
cards;
ABC Test1 123 Pref1
ABC Test1 456 Pref1
ABC Test1 789 Pref1
ABC Test1 123 Pref2
ABC Test1 456 Pref2
ABC Test1 789 Pref2
run;
data want;
set have;
by id;
retain t_prefer;
if first.id then t_prefer = prefer;
if prefer = t_prefer;
drop t_prefer;
run;
The trick is to use a retain statement so that a copy of the value of prefer from the first row per id is carried over between iterations of the data step, and you can then output only rows with that value of prefer.
You could just keep track of which group number the current record is in.
data want;
set have;
by ID PREFER;
if first.id then group=0;
group+first.prefer;
if group=1;
run;
Related
I am looking to figure out how many customers get their product from a certain store. The problem each prod_id can have up to 12 weeks of data for each customer. I have tried a multitude of codes, some add up all of the obersvations for each customer while others like the one below remove all but the last observation.
proc sort data= have; BY Prod_ID cust; run;
Data want;
Set have;
by Prod_Id cust;
if (last.Prod_Id and last.cust);
count= +1;
run;
data have
prod_id cust week store
1 A 7/29 ABC
1 A 8/5 ABC
1 A 8/12 ABC
1 A 8/19 ABC
1 B 7/29 ABC
1 B 8/5 ABC
1 B 8/12 ABC
1 B 8/19 ABC
1 B 8/26 ABC
1 C 7/29 XYZ
1 C 8/5 XYZ
1 F 7/29 XYZ
1 F 8/5 XYZ
2 A 7/29 ABC
2 A 8/5 ABC
2 A 8/12 ABC
2 A 8/19 ABC
2 C 7/29 EFG
2 C 8/5 EFG
2 C 8/12 EFG
2 C 8/19 EFG
2 C 8/26 EFG
what i want it to look like
prod_id store count
1 ABC 2
1 XYZ 2
2 ABC 1
2 EFG 2
Firstly, read about if-statement.
I've just edited your code to make it work:
proc sort data=have;
by prod_id store cust;
run;
data want(drop=cust week);
set have;
retain count;
by prod_id store cust;
if (last.cust) then count=count+1;
else if (first.prod_id or first.store) then count = 0;
if (last.prod_id or last.store) then output;
run;
If you will have questions, ask.
The only place where the result of the COUNT() aggregate function in SQL might be confusing is that it will not count missing values of the variable.
select prod_id
, store
, count(distinct cust) as count
, count(distinct cust)+max(missing(cust)) as count_plus_missing
from have
group by prod_id ,store
;
I am trying to count number of times all the values appear in the entire dataset. So I want a table/output with values - # of times it appears in the dataset. I have used proc sql, proc freq without any luck.
data Data1;
input xx yy zz;
datalines;
123 456 234
456 123 345
234 345 123
;
run;
Want a table output with 123 - 3, 234 - 2, etc.
The easiest option (I think) is to create a dataset that puts all the values in a single column, then you can just run a proc freq off that.
data have;
input xx yy zz;
datalines;
123 456 456
456 123 234
234 234 123
;
run;
data single_column;
set have;
array vars{*} xx yy zz;
do i = 1 to dim(vars);
all_vals = vars{i};
output;
end;
keep all_vals;
run;
proc freq data=single_column;
table all_vals / out=want;
run;
I want to retrieve row level values (Loans associated with account number) from a SAS table -
Please find below example.
Input
Account Number Loans
123 abc, def, ghi
456 jkl, mnopqr, stuv
789 w, xyz
Output
Account Numbers Loans
123 abc
123 def
123 ghi
456 jkl
456 mnopqr
456 stuv
789 w
789 xyz
Loans are separated by commas and they don't have fix length.
Use countw() to count the number of values on a line and scan() to pick them out.
Both have a last optional variable to specify the separator, which in your case is ,.
data Loans (keep= AccountNo Loan);
infile datalines truncover;
Input #1 AccountNo 3. #17 LoanList $250.;
if length(LoanList) gt 240 then put 'WARNING: You might need to extend Loans';
label AccountNo = 'Account Number' Loan = 'Loans';
do loanNo = 1 to countw(LoanList, ',');
Loan = scan(LoanList, loanNo, ',');
output;
end;
datalines;
123 abc, def, ghi
456 jkl, mnopqr, stuv
789 w, xyz
;
proc print data=Loans label noobs;
run;
The reverse operation requires different techniques.
To enable by AccountNo processing, we must first construct a SAS dataset from the input and then read that back in with a set statement.
data Loans;
infile datalines;
input #1 AccountNo 3. #5 Loan $25.;
datalines;
123 15-abc
123 15-def
123 15-ghi
456 99-jkl
456 99-mnopqr
456 99-stuv
789 77-w
789 77-xyz
;
data LoanLists;
set Loans;
by AccountNo;
Now create your Loanlist long enough and overwrite the default behaviour of SAS to re-initialise all variables for every observation (=row of data).
format Loanlist $250.;
retain Loanlist;
Collect all loans for an account, separating them with comma an blank.
if first.AccountNo then Loanlist = Loan;
else Loanlist = catx(', ',Loanlist,Loan);
if length(LoanList) gt 240 then put 'WARNING: you might need to extend LoanList';
Keep only the full list per account.
if last.AccountNo;
drop Loan;
proc print;
run;
Given the following dataset:.
obs var1 var2 var3
1 123 456 .
2 123 . 789
3 . 456 789
How does one go about to append all the variables into a single variable whilst ignoring the empty observations (denoted by ".")?
Desired output:.
obs var4
1 123
2 123
3 456
4 456
5 789
6 789
Data step:.
data have;
input
var1 var2 var3; cards;
123 456 .
123 . 789
. 456 789
;run;
Not sure why you read the numbers in as char, but if I change to num, it could be done like this:
data have;
input var1 var2 var3;
cards;
123 456 .
123 . 789
. 456 789
;run;
data want (keep=var4);
set have;
var4=var1;if var4 ne . then output;
var4=var2;if var4 ne . then output;
var4=var3;if var4 ne . then output;
run;
OK, let's assume you have a file vith the values in it, and you do not know how many variables are in each row. First I need to create a sample textfile:
filename x temp;
data _nulL_;
file x;
put "123 456 . ";
put "123 . 789 ";
put ". 456 789 ";
run;
Then I need to read the first line and count the number of variables:
data _null_;
infile x;
input;
call symputx("number_of_variables",put(countw(_infile_," ","c"),best.));
stop;
run;
%put &number_of_variables;
Now I can dynamically read the variables:
%macro doit();
data have;
infile x;
input
%do i=1 %to &number_of_variables;
var&i
%end;
;
run;
data want (keep=var%eval(&number_of_variables + 1));
set have;
%do i=1 %to &number_of_variables;
var%eval(&number_of_variables + 1)=var&i;
if var%eval(&number_of_variables + 1) ne . then output;
%end;
run;
%mend;
%doit;
You can use proc transpose to do this but there is a trick to doing so. You will need to append a unique identifier to each row, prior to doing the transpose.
I've taken #Stig's sample data and added the observation number to use as a unique identifier:
data have;
input var1 var2 var3;
x = _n_; * ADDING A UNIQUE IDENTIFIER TO EVERY ROW;
cards;
123 456 .
123 . 789
. 456 789
;run;
Then it's simply a case of running proc transpose:
proc transpose data=have out=xx;
by x;
run;
And finally, remove any results where col1 is missing, and add in the observation number:
data want;
obs = _n_;
set xx (keep=col1);
where col1 ne .;
run;
As the order is not important then you can do this in one step, using arrays. As the data step moves through each row, the array enables the variable values to be stored in memory, so you can loop through them. I've set it up so that each time a non-missing value is found, then output it to the new variable.
In creating the array, I've set it to var1--var3, the double dash means all variables between var1 and var3 inclusive. If your real variables are numbered the same way then you can use var1-var3, which means all sequential numbers between the two variables.
data have;
input var1 var2 var3;
datalines;
123 456 .
123 . 789
. 456 789
;
run;
data want;
set have;
array allnums var1--var3;
do i = 1 to dim(allnums);
if not missing(allnums{i}) then do;
var4 = allnums{i};
output;
end;
end;
drop var1--var3 i;
run;
I have a dataset that looks something like this:
IDnum State Product Consumption
123 MI A 30
123 MI B 20
123 MI C 45
456 NJ A 15
456 NJ D 10
789 MI B 60
... ... ... ...
And i would like to create a new dataset, where i have one row for each IDnum, and a new dummy variable for each different product (in my real dataset i have close to 1000 products), along with it's associated consumption. It would look like something in these lines
IDnum State Prod.A Cons.A Prod.B Cons.B Prod.C Cons.C Prod.D Cons.D
123 MI yes 30 yes 20 yes 45 no -
456 NJ yes 15 no - no - yes 10
789 MI no - yes 60 no - no -
... ... ... ... ... ... ... ... ... ...
Some variables like "State" doesn't change within the same IDnum, but each row in the original bank are equivalent to one purchase, hence the change in the "product" and "consumption" variables for the same IDnum. I would like that my new dataset showed all the consumption habits of each costumer in one single row, but so far i have failed.
Any help would be greatly apreciated.
Without yes/no variables, it's really easy:
data input;
length State $2 Product $1;
input IDnum State Product Consumption;
cards;
123 MI A 30
123 MI B 20
123 MI C 45
456 NJ A 15
456 NJ D 10
789 MI B 60
;
run;
proc transpose data=input out=output(drop=_NAME_) prefix=Cons_;
var Consumption;
id Product;
by IDnum State;
run;
Adding the yes/no fields:
proc sql;/* from column names or alternatively
create it from source data directly if not taking too long */
create table work.products as
select scan(name, 2, '_') as product length=1
from dictionary.columns
where libname='WORK' and memname='OUTPUT'
and upcase(name) like 'CONS_%';
quit;
filename vars temp;/* write a temp file containing variable definitions
in desired order */
data _null_;
set work.products end=last;
file vars;
length str $40;
if _N_ = 1 then put 'LENGTH ';
str = catt('Prod_', product, ' $3');
put str;
str = catt('Cons_', product, ' 8');
put str;
if last then put ';';
run;
options source2;
data output2;
length IdNum 8 State $2;
%include vars;
set output;
array prod{*} Prod_:;
array cons{*} Cons_:;
drop i;
do i=1 to dim(prod);
if coalesce(cons(i), 0) ne 0 then prod(i)='yes';
else prod(i)='no';
end;
run;