I feel like I am making this more complicated than it should be. I have a sample dataset below with an ID column and a Counter column. The counter column resets and I would like to create a dataset containing only the rows where the counter column is the max value before it resets again. My dataset also has thousands of ID's that I would need to do this for.
data test;
infile datalines delimiter=",";
informat ID $3.
TCOUNT 10.;
input ID $ TCOUNT $ ;
datalines;
123,1
123,2
123,3
123,4
123,1
123,2
123,3
123,1
123,2
;
run;
and my desired output in a new table would look like...
ID TCOUNT
123 4
123 3
123 2
It might be easiest/clearest to first assign a label to each of the non-decreasing TCOUNT blocks of observations.
data groups;
set test;
by id ;
if first.id then group=0;
if first.id or tcount<lag(tcount) then group+1;
run;
Then it is a simple matter to find the last observation in each group.
data want;
set groups;
by id group;
if last.group;
run;
Related
I have a table if registrations of cars (https://i.stack.imgur.com/Qjnl6.png). I need to transpose it into one row per vin number with all info about its registrations so that i will have smth like this:
vin|company_1|start_date_1|end_date_1|company_2|start_date_2|end_date_2|...|company_n|start_date_n|end_date_n, where n is max number of registrations. Please help with code or hints.
I tried proc transpose, but i got start_and and end_date in separate rows, so it doesn't go
proc transpose data = test_vin name=VarName out= outdata;
by vin_number;
var company start_date_date9 end_date_date9;
run;
In the future, please do not post data as images.
You will need to do this manually with a data step. You'll need to first scan your dataset to find the maximum number of columns needed for an array to be sure that there are enough columns for each VIN.
We'll do this with a data step, arrays, and the retain statement. We'll continually add values to each array until we reach the last VIN. When we are at the last VIN, we'll output all of our results and reset them.
Sample data:
data have;
input vin$ company$ start_date9:date9. end_date9:date9.;
format start_date9 end_date9 date9.;
datalines;
A company1 01JAN2020 05JAN2020
A company2 06JAN2020 10JAN2020
A company3 11JAN2020 15JAN2020
A company4 16JAN2020 19JAN2020
B company5 01FEB2020 02FEB2020
B company6 03FEB2020 10FEB2020
B company7 11FEB2020 20FEB2020
B company8 21FEB2020 28FEB2020
;
run;
Code:
data want;
set have;
by vin;
array company_[4] $;
array start_date_[4];
array end_date_[4];
/* Do not reset values at the start of each row */
retain company_: start_date_: end_date_:;
/* Reset the counter and values for each VIN */
if(first.vin) then do;
i = 1;
call missing(of company_:, of stat_date_:, of end_date_:);
end;
else i+1;
/* Store each company and date */
company_[i] = company;
start_date_[i] = start_date9;
end_date_[i] = end_date9;
/* Only output one row per VIN */
if(last.VIN) then output;
format start_date_: end_date_: date9.;
keep vin company_: start_date_: end_date_:;
run;
Output:
vin company_1 company_2 company_3 company_4 start_date_1 ...
A company1 company2 company3 company4 01JAN2020 ...
B company5 company6 company7 company8 01FEB2020 ...
I need some help in trying to execute a comparison of rows within different ID variable groups, all in a single dataset.
That is, if there is any duplicate observation within two or more ID groups, then I'd like to delete the observation entirely.
I want to identify any duplicates between rows of different groups and delete the observation entirely.
For example:
ID Value
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
The output I desire is:
ID Value
1 D
3 Z
I have looked online extensively, and tried a few things. I thought I could mark the duplicates with a flag and then delete based off that flag.
The flagging code is:
data have;
set want;
flag = first.ID ne last.ID;
run;
This worked for some cases, but I also got duplicates within the same value group flagged.
Therefore the first observation got deleted:
ID Value
3 Z
I also tried:
data have;
set want;
flag = first.ID ne last.ID and first.value ne last.value;
run;
but that didn't mark any duplicates at all.
I would appreciate any help.
Please let me know if any other information is required.
Thanks.
Here's a fairly simple way to do it: sort and deduplicate by value + ID, then keep only rows with values that occur only for a single ID.
data have;
input ID Value $;
cards;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
;
run;
proc sort data = have nodupkey;
by value ID;
run;
data want;
set have;
by value;
if first.value and last.value;
run;
proc sql version:
proc sql;
create table want as
select distinct ID, value from have
group by value
having count(distinct id) =1
order by id
;
quit;
This is my interpretation of the requirements.
Find levels of value that occur in only 1 ID.
data have;
input ID Value:$1.;
cards;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
;;;;
proc print;
proc summary nway; /*Dedup*/
class id value;
output out=dedup(drop=_type_ rename=(_freq_=occr));
run;
proc print;
run;
proc summary nway;
class value;
output out=want(drop=_type_) idgroup(out[1](id)=) sum(occr)=;
run;
proc print;
where _freq_ eq 1;
run;
proc print;
run;
A slightly different approach can use a hash object to track the unique values belonging to a single group.
data have; input
ID Value:& $1.; datalines;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
run;
proc delete data=want;
proc ds2;
data _null_;
declare package hash values();
declare package hash discards();
declare double idhave;
method init();
values.keys([value]);
values.data([value ID]);
values.defineDone();
discards.keys([value]);
discards.defineDone();
end;
method run();
set have;
if discards.find() ne 0 then do;
idhave = id;
if values.find() eq 0 and id ne idhave then do;
values.remove();
discards.add();
end;
else
values.add();
end;
end;
method term();
values.output('want');
end;
enddata;
run;
quit;
%let syslast = want;
I think what you should do is:
data want;
set have;
by ID value;
if not first.value then flag = 1;
else flag = 0;
run;
This basically flags all occurrences of a value except the first for a given ID.
Also I changed want and have assuming you create what you want from what you have. Also I assume have is sorted by ID value order.
Also this will only flag 1 D above. Not 3 Z
Additional Inputs
Can't you just do a sort to get rid of the duplicates:
proc sort data = have out = want nodupkey dupout = not_wanted;
by ID value;
run;
So if you process the observations by VALUE levels (instead of by ID levels) then you just need keep track of whether any ID is ever different than the first one.
data want ;
do until (last.value);
set have ;
by value ;
if first.value then first_id=id;
else if id ne first_id then remapped=1;
end;
if not remapped;
keep value id;
run;
I would like to turn the following long dataset:
data test;
input Id Injury $;
datalines;
1 Ankle
1 Shoulder
2 Ankle
2 Head
3 Head
3 Shoulder
;
run;
Into a wide dataset that looks like this:
ID Ankle Shoulder Head
1 1 1 0
2 1 0 1
3 0 1 1'
This answer seemed the most relevant but was falling over at the proc freq stage (my real dataset is around 1 million records, and has around 30 injury types):
Creating dummy variables from multiple strings in the same row
Additional help: https://communities.sas.com/t5/SAS-Statistical-Procedures/Possible-to-create-dummy-variables-with-proc-transpose/td-p/235140
Thanks for the help!
Here's a basic method that should work easily, even with several million records.
First you sort the data, then add in a count to create the 1 variable. Next you use PROC TRANSPOSE to flip the data from long to wide. Then fill in the missing values with a 0. This is a fully dynamic method, it doesn't matter how many different Injury types you have or how many records per person. There are other methods that are probably shorter code, but I think this is simple and easy to understand and modify if required.
data test;
input Id Injury $;
datalines;
1 Ankle
1 Shoulder
2 Ankle
2 Head
3 Head
3 Shoulder
;
run;
proc sort data=test;
by id injury;
run;
data test2;
set test;
count=1;
run;
proc transpose data=test2 out=want prefix=Injury_;
by id;
var count;
id injury;
idlabel injury;
run;
data want;
set want;
array inj(*) injury_:;
do i=1 to dim(inj);
if inj(i)=. then inj(i) = 0;
end;
drop _name_ i;
run;
Here's a solution involving only two steps... Just make sure your data is sorted by id first (the injury column doesn't need to be sorted).
First, create a macro variable containing the list of injuries
proc sql noprint;
select distinct injury
into :injuries separated by " "
from have
order by injury;
quit;
Then, let RETAIN do the magic -- no transposition needed!
data want(drop=i injury);
set have;
by id;
format &injuries 1.;
retain &injuries;
array injuries(*) &injuries;
if first.id then do i = 1 to dim(injuries);
injuries(i) = 0;
end;
do i = 1 to dim(injuries);
if injury = scan("&injuries",i) then injuries(i) = 1;
end;
if last.id then output;
run;
EDIT
Following OP's question in the comments, here's how we could use codes and labels for injuries. It could be done directly in the last data step with a label statement, but to minimize hard-coding, I'll assume the labels are entered into a sas dataset.
1 - Define Labels:
data myLabels;
infile datalines dlm="|" truncover;
informat injury $12. labl $24.;
input injury labl;
datalines;
S460|Acute meniscal tear, medial
S520|Head trauma
;
2 - Add a new query to the existing proc sql step to prepare the label assignment.
proc sql noprint;
/* Existing query */
select distinct injury
into :injuries separated by " "
from have
order by injury;
/* New query */
select catx("=",injury,quote(trim(labl)))
into :labls separated by " "
from myLabels;
quit;
3 - Then, at the end of the data want step, just add a label statement.
data want(drop=i injury);
set have;
by id;
/* ...same as before... */
* Add labels;
label &labls;
run;
And that should do it!
I'm not very familiar with Do Loops in SAS and was hoping to get some help. I have data that looks like this:
Product A: 1
Product A: 2
Product A: 4
I'd like to transpose (easy) and flag that Product A: 3 is missing, but I need to do this iteratively to the i-th degree since the number of products is large.
If I run the transpose part in SAS, my first column will be 1, second column will be 2, and third column will be 4 - but I'd really like the third column to be missing and the fourth column to be 4.
Any thoughts? Thanks.
Get some sample data:
proc sort data=sashelp.iris out=sorted;
by species;
run;
Determine the largest column we will need to transpose to. Depending on your situation you may just want to hardcode this value using a %let max=somevalue; statement:
proc sql noprint;
select cats(max(sepallength)) into :max from sorted;
quit;
%put &=max;
Transpose the data using a data step:
data want;
set sorted;
by species;
retain _1-_&max;
array a[1:&max] _1-_&max;
if first.species then do;
do cnt = lbound(a) to hbound(a);
a[cnt] = .;
end;
end;
a[sepallength] = sepallength;
if last.species then do;
output;
end;
keep species _1-_&max;
run;
Notice we are defining an array of columns: _1,_2,_3,..._max. This happens in our array statement.
We then use by-group processing to populate these newly created columns for a single species at a time. For each species, on the first record, we clear the array. For each record of the species, we populate the appropriate element of the array. On the final record for the species output the array contents.
You need a way to tell SAS that you have 4 products and the values are 1-4. In this example I create dummy ID with the needed information then transpose using ID statement to name new variables using the value of product.
data product;
input id product ##;
cards;
1 1 1 2 1 4
2 2 2 3
;;;;
run;
proc print;
run;
data productspace;
if 0 then set product;
do product = 1 to 4;
output;
end;
stop;
run;
data productV / view=productV;
set productspace product;
run;
proc transpose data=productV out=wide(where=(not missing(id))) prefix=P;
by id;
var product;
id product;
run;
proc print;
run;
I am trying to convert a categorical variable (Product) in binary and then want to know how many products per customer.
data is in the following format:
ID Product
C1 A
C1 B
C2 A
C3 B
C4 A
The code I am using for converting category to binary
IF PRODUCT="A" THEN PROD_A =1 ; ELSE PROD_A=0;
IF PRODUCT="B" THEN PROD_B =1 ; ELSE PROD_B=0;
TOT_PROD = SUM(PROD_A, PROD_B);
But when I count no. of product it gives me '1' for all customer and I am expecting 1 or 2.
I have tried
TOT_PROD = PROD_A + PROD_B;
but I get the same results
This is all inside one datastep, correct? If so you're processing only one line at a time. For each individual line the only possible values for PROD_A and PROD_B are one or zero. You need an aggregate function. For example, if your dataset is named PRODUCTS:
DATA X;
SET PRODUCTS;
IF PRODUCT="A" THEN PROD_A = 1 ; ELSE PROD_A=0;
IF PRODUCT="B" THEN PROD_B = 1 ; ELSE PROD_B=0;
TOT_PROD = SUM(PROD_A, PROD_B);
RUN;
(TOT_PROD will always be equal to 1 in X, but never mind for now).
Now sum them up:
proc sql;
create table prod_totals as
select product, sum(tot_prod) as total_products
from x
group by product;
quit;
More simply just skip the data step:
proc sql;
create table prod_totals as
select product, count(*) as total_products
from products
group by product;
quit;
Or use PROC SUMMARIZE or PROC MEANS instead of PROC SQL.
I have assumed you only want 1 record output per id.
In the solutions below I have employed the DOW-Loop (DO-Whitlock).
If you wanted prod_a and prod_b just to help with the totals and if they're not required in the output, then you could use something like:
data want;
do until(last.id);
set have;
by id;
tot_prod=sum(tot_prod,product='A',product='B');
end;
run;
If you need prod_a and prod_b in the output, then you could use:
data want;
do until(last.id);
set have;
by id;
prod_a=(product='A');
prod_b=(product='B');
tot_prod=sum(tot_prod,prod_a,prod_b);
end;
run;
In both data steps the last product per id will be output along with the other variables and in the case of the 2nd data step example the last prod_a & prod_b per id will also be output.
To do this in the data step, you need retain. Make sure you've sorted the dataset by id first.
data prod_totals;
set products;
by ID;
retain prod_a prod_b;
if first.id then do; *initialize to zero for each new ID;
prod_a=0; prod_b=0;
end;
if product='A' then prod_a=1; *set to 1 for each one found;
else if product='B' then prod_b=1;
if last.id then do; *for last record in each ID, output and sum total;
total_products=sum(prod_a,prod_b);
output;
end;
keep id prod_a prod_b total_products;
run;