(SAS) Removing duplicates during concatenation - sas

I'm using the following data step to concatenate multiple observations' values into one variable:
data Data_PreFinal;
set work.reasons;
by Number;
length Changes $4000.;
retain Changes;
if first.Number then Changes = EndoReason;
else Changes = catx(', ', Changes, EndoReason);
if last.Number then output;
run;
For example, I would like to ensure that if the dataset Reasons looks like this:
Number EndoReason
1 Bucket1
1 Bucket2
1 Bucket1
1 Bucket3
1 Bucket2
1 Bucket2
2 Bucket2
2 Bucket2
2 Bucket1
2 Bucket2
that the resulting dataset, Data_PreFinal, looks like this:
Number EndoReason
1 Bucket1, Bucket2, Bucket3
2 Bucket2, Bucket1
instead of listing out all the duplicate values in the EndoReason variable.
Any help will be greatly appreciated!
Thanks!

Just do a search in the current Changes string for the specific row's value and only concatenate if it doesn't already exist. The index function is the one to use and I've also modified your code slightly to use call catx instead of catx (I think it's neater in these situations).
data reasons;
input Number EndoReason $;
datalines;
1 Bucket1
1 Bucket2
1 Bucket1
1 Bucket3
1 Bucket2
1 Bucket2
2 Bucket2
2 Bucket2
2 Bucket1
2 Bucket2
;
run;
data Data_PreFinal;
set work.reasons;
by Number;
length Changes $4000.;
retain Changes;
if first.Number then call missing(Changes);
if not index(Changes,trim(EndoReason)) then call catx(', ', Changes, EndoReason);
if last.Number then output;
run;

Friend!, maybe it could be useful to delete duplicate observations first. For example:
data reasons;
input Number EndoReason : $30.;
datalines;
1 Bucket1
1 Bucket2
1 Bucket1
1 Bucket3
1 Bucket2
1 Bucket2
2 Bucket2
2 Bucket2
2 Bucket1
2 Bucket2
;
*Only eliminate duplicates;
proc sort data=reasons out=reasons_nodup nodup;
by Number EndoReason;
run;
data Data_PreFinal;
set work.reasons_nodup;
by Number;
length Changes $4000.;
retain Changes;
if first.Number then Changes = EndoReason;
else Changes = catx(', ', Changes, EndoReason);
if last.Number then output;
drop EndoReason;
rename Changes = EndoReason;
run;
Good luck!

Related

SAS group by counters per variable - primary key creation

I have some data which needs to be split into 12 or so different groups, there is no key and the order the data is in is important.
The data has a number of groups and those groups have singular and / or nested groups within that. Each group will be split out as the data is in a hierarchical format. so each "GROUP" then has its own format which then all needs to be joined up on one line (or many) rows.
Sample data file:
"TRANS","23115168","","","OTVST","","23115168","","COMLT","","",20180216,"OAMI","501928",,
"MTPNT","UPDTE",2415799999,"","","17","","",,20180216,
"ASSET","","REPRT","METER","","CR","E6VG470","LPG",2017,"E6S05633099999","","","LI"
"METER","","U","S1",6.0000,"","",20171108,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00990"
"ASSET","","REMVE","METER","","CR","E6VG470","LPG",2017,"E6S05633099999","","","LI"
"METER","","U","S1",6.0000,"","",20171108,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00990"
"ASSET","","INSTL","METER","","CR","E6VG470","LPG",2017,"E6S06769699999","","","LI"
"METER","","U","S1",6.0000,"","",20180216,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00000"
"APPNT","",20180216,,"","123900",""
The hierarchy that should exist when data is input. I am thinking there could be several tables that can be joined together later. (numbers for illustration of parent child levels)
1. Transaction [TRANS]
1.1. Meter Point [MTPNT]
1.1.1. Asset [ASSET]
1.1.1.1. Meter [METER]
1.1.1.2. Converter [CONVE]
1.1.1.3. Register Details [REGST]
1.1.1.3.1. Reading [READG]
1.1.1.4. Market Participant [MKPRT]
1.1.1.5. Name [NAME]
1.1.1.5.1. Address [ADDRS]
1.1.1.5.2. Contact Mechanism [CONTM]
1.2. Appointment [APPNT]
1.3. Name [NAME]
1.3.1. Address [ADDRS]
1.3.2. Contact Mechanism [CONTM]
1.4. Market Participant [MKPRT]
The industry GAS data, so in this flow you can have many ASSET per MTPNT, and those many ASSET can have many REGST because this is where the meter reading is kept for READG
I have tried using by groups and iterative first. processing, but i have not worked with this type of data before. I need a way to split create a key per grouping, which when split up and the fields are defined, can be joined back together.
I have tried manipulating the infile so that all the data appears on one line per TRANS, but then i still have the issue of applying the fields, and ordering is paramount.
I have managed to get a few keys for some of the groups, but after splitting they dont quite join back together.
data TRANS;
set mpancreate_a;
by DataItmGrp NOTSORTED;
if first.DataItmGrp then
do;
if DataItmGrp = "TRANS" then
TRANSKey+1;
end;
run;
data TRANS;
set TRANS;
TRANSKey2 + 1;
by DataItmGrp NOTSORTED;
if first.DataItmGrp then
do;
if DataItmGrp = "TRANS" then
TRANSKEY2=1;
end;
run;
data MTPNT;
set TRANS;
by DataItmGrp NOTSORTED;
if first.DataItmGrp then
do;
if DataItmGrp = "MTPNT" then
MTPNTKEY+1;
end;
run;
data MTPNT;
set MTPNT;
by MTPNTKEY NOTSORTED;
if first.MTPNTKEY and DataItmGrp = "MTPNT" then
MTPNTKEY2=0;
MTPNTKEY2+1;
run;
data ASSET;
set MTPNT;
IF MTPNTKEY = 0 THEN
MTPNTKEY2=0;
by DataItmGrp NOTSORTED;
if first.DataItmGrp then
do;
if DataItmGrp = "ASSET" then
ASSETKEY+1;
end;
run;
data ASSET;
set ASSET;
by ASSETKEY NOTSORTED;
if first.ASSETKEY and DataItmGrp = "ASSET" then
ASSETKEY2=0;
ASSETKEY2+1;
IF ASSETKEY =0 THEN
ASSETKEY2=0;
run;
i want a counter for each group found, and a retained counter for that particular group - but i cannot work out how to get in and out of the groupings based on the hierarchy above
i'm hoping that once i have these keys, i can split the data by group and then left join back together
_n_ TRANS TRANS2 MTPNT MTPNT2
TRANS 1 1 0 0 0
MTPNT 2 2 1 1 1
ASSET 3 3 1 2 1
METER 4 4 1 3 1
READG 5 5 1 4 1
MTPNT 6 6 1 1 2
ASSET 7 7 1 2 2
METER 8 8 1 3 2
READG 9 9 1 4 2
APPNT 10 10 1 5 2
TRANS 11 1 2 6 2
MTPNT 12 2 2 1 3
ASSET 13 3 2 2 3
METER 14 4 2 3 3
READG 15 5 2 4 3
MTPNT 16 6 2 1 4
ASSET 17 7 2 2 4
METER 18 8 2 3 4
READG 19 9 2 4 4
APPNT 20 10 2 5 4
The input of hierarchical data from a data file that has no definitive markers is problematic. The best suggestion I have is to understand what are the salient values you want to extract and in what context do you want to know them. For this problem a simplest first approach would be to have a single monolithic table with categorical variables to capture the path that descends to the salient value (meter reading).
A more complex situation would be the first token in each line drives the input for that line and the output table it belongs to. Since there are no landmarks as to hierarchy absolute or relative position (as in the NAME and MKPRT) there is no 100% confident way to place them in the hierarchy and that can also affect the placement of items read-in from subsequent data lines.
Depending on the true complexity and adherence to rules in the real world you may or may not 'miss out' the reading of some values.
Suppose there is the simpler goal of just getting the meter readings.
data want;
length tier level1-level6 $8 path $64 meterReadingString $8 dummy $1;
retain level1-level5 path;
attrib readingdate informat=yymmdd10. format=yymmdd10.;
infile cards dsd missover;
input #1 tier #; * held input - dont advance read line yet;
if tier="TRANS" then do;
level1 = tier;
call missing (of level2-level6);
path = catx("/", of level:);
end;
if tier="MTPNT" and path="TRANS" then do;
level2 = tier;
call missing (of level3-level6);
path = catx("/", of level:);
end;
if tier="ASSET" and path="TRANS/MTPNT" then do;
level3 = tier;
call missing (of level4-level6);
path = catx("/", of level:);
end;
if tier="METER" and path="TRANS/MTPNT/ASSET" then do;
level4 = tier;
call missing (of level5-level6);
path = catx("/", of level:);
end;
if tier="REGST" and path="TRANS/MTPNT/ASSET/METER" then do;
level5 = tier;
call missing (of level6-level6);
path = catx("/", of level:);
end;
if tier="READG" and path="TRANS/MTPNT/ASSET/METER/REGST" then do;
level6 = tier;
path = catx("/", of level:);
input #1 tier readingdate dummy meterReadingString #; * reread line according to tier;
meterReading = input(meterReadingString, best12.);
if path = "TRANS/MTPNT/ASSET/METER/REGST/READG" then OUTPUT;
end;
datalines;
"TRANS","23115168","","","OTVST","","23115168","","COMLT","","",20180216,"OAMI","501928",,
"MTPNT","UPDTE",2415799999,"","","17","","",,20180216,
"ASSET","","REPRT","METER","","CR","E6VG470","LPG",2017,"E6S05633099999","","","LI"
"METER","","U","S1",6.0000,"","",20171108,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00990"
"ASSET","","REMVE","METER","","CR","E6VG470","LPG",2017,"E6S05633099999","","","LI"
"METER","","U","S1",6.0000,"","",20171108,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00990"
"ASSET","","INSTL","METER","","CR","E6VG470","LPG",2017,"E6S06769699999","","","LI"
"METER","","U","S1",6.0000,"","",20180216,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00000"
"APPNT","",20180216,,"","123900",""
run;
You can use this as the basis of a more complicated reader that has a different output <tier> data set for each tier or path to tier encountered. You would need a different input statement per tier, similar to how READG is read.

Custom spss table with missing value info and number of categories

Complete SPSS beginner here. I am really lost trying to come up with a custom table. I have a few variables and I want the final table to look like:
Var_name N_valid N_missing N_categories Max_%_category
Var1 X Y Z W
Var2 A B C D
By Max_%_category I mean the percentage of the value that is repeated the most. So for this example data:
data list free / Var1 to Var4 (4F1.0).
begin data
1 0 1 0
1 1 0 0
0 0 0 1
2 1 3 1
. . 2 .
end data.
It would be:
Var_name N_valid N_missing N_categories Max_%_category
Var1 4 1 3 50%
Var3 5 0 4 40%
Is CTABLE the route? I couldn't find how to count N_valid and N_missing easily. The FREQUENCIES command sort of works but I don't know how to only create the first table with the missing info.
Someone could probably help you with custom tables, but not being a big fan myself, here's a way to get the same results right in your data window:
data list free / Var1 to Var4 (4F1.0).
begin data
1 0 1 0
1 1 0 0
0 0 0 1
2 1 3 1
. . 2 .
end data.
dataset name origData.
dataset copy tmp.
dataset activate tmp.
varstocases /make val from var1 to var4/index=var(val)/null=keep.
aggregate out=*/break var val/n=n.
if missing(val) msn=n.
if not missing(val) vld=n.
aggregate out=*/break=var/N_valid N_missing=sum(vld msn)/N_categories=n(vld)/Max_category_N=max(vld).
compute Max_category_P=Max_category_N/N_valid.
dataset name tab1.
*you can add a bit of formatting and corrections:.
compute Max_category_P=Max_category_P*100.
FORMATS Max_category_P (PCT40.1).
recode N_missing (miss=0).
exe.
*now you can return to the original data to start over with a new analysis.
dataset activate origData.

Create new variable based on column value in current row as well as previous row?

Given the data set:
Year var1
0 .56
1 .39
2 .28
3 .15
4 .09
How would you create a new column, var2 = (var1n)/(var1n-1), and have var21 equal to var11? The desired output would look like this:
Year var1 var2
0 .56 .56
1 .39 .70
2 .28 .72
3 .15 .54
4 .09 .60
I know how to use the RETAIN statement for creating a cumulative variable but I'm not sure how to use it for this particular task. Any insight you might have is greatly appreciated.
It is easy enough to do with the LAG() function. Just remember to make sure to always run the LAG() function on every observation. So first calculate VAR2 and then if it is the first observation then overwrite the value with your required initial value.
data want ;
input year var1 ;
var2 = divide(var1,lag(var1));
if _n_=1 then var2=var1;
cards;
0 .56
1 .39
2 .28
3 .15
4 .09
;

Setting *most* variables to missing, while preserving the contents of a select few

I have a dataset like this (but with several hundred vars):
id q1 g7 q3 b2 zz gl az tre
1 1 2 1 1 1 2 1 1
2 2 3 3 2 2 2 1 1
3 1 2 3 3 2 1 3 3
4 3 1 2 2 3 2 1 1
5 2 1 2 2 1 2 3 3
6 3 1 1 2 2 1 3 3
I'd like to keep id, b2, and tre, but set everything else to missing. In a dataset this small, I can easily use call missing (q1, g7, q3, zz, gl, az) - but in a set with many more variables, I would effectively like to say call missing (of _ALL_ *except ID, b2, tre*).
Obviously, SAS can't read my mind. I've considered workarounds that involve another data step or proc sql where I copy the original variables to a new ds and merge them back on post, but I'm trying to find a more elegant solution.
This technique uses an un-executed set statement (compile time function only) to define all variables in the original data set. Keeps the order and all variable attributes type, labels, format etc. Basically setting all the variables to missing. The next SET statement which will execute brings in only the variables the are NOT to be set to missing. It doesn't explicitly set variables to missing but achieves the same result.
data nomiss;
input id q1 g7 q3 b2 zz gl az tre;
cards;
1 1 2 1 1 1 2 1 1
2 2 3 3 2 2 2 1 1
3 1 2 3 3 2 1 3 3
4 3 1 2 2 3 2 1 1
5 2 1 2 2 1 2 3 3
6 3 1 1 2 2 1 3 3
;;;;
run;
proc print;
run;
data manymiss;
if 0 then set nomiss;
set nomiss(keep=id b2 tre:);
run;
proc print;
run;
Another fairly simple option is to set them missing using a macro, and basic code writing techniques.
For example, let's say we have a macro:
%call_missing(var=);
call missing(&var.);
%mend call_missing;
Now we can write a query that uses dictionary.columns to identify the variables we want set to missing:
proc sql;
select name
from dictionary.columns
where libname='WORK' and memname='HAVE'
and not (name in ('ID','B2','TRE')); *note UPCASE for all these;
quit;
Now, we can combine these two things to get a macro variable containing code we want, and use that:
proc sql;
select cats('%call_missing(var=',name ,')')
into :misslist separated by ' '
from dictionary.columns
where libname='WORK' and memname='HAVE'
and not (name in ('ID','B2','TRE')); *note UPCASE for all these;
quit;
data want;
set have;
&misslist.;
run;
This has the advantage that it doesn't care about the variable types, nor the order. It has the disadvantage that it's somewhat more code, but it shouldn't be particularly long.
If the variables are all of the same type (numeric or character) then you could use an array.
data want ;
set have;
array _all_ _numeric_ ;
do over _all_;
if upcase(vname(_all_)) not in ('ID','B2') then _all_=.;
end;
run;
If you don't care about the order then just drop the variables and add them back on with 0 observations.
data want;
set have (keep=ID B2 TRE:) have (obs=0 drop=ID B2 TRE:);
run;

Why many to many merge doesn't do cartesian product

data jul11.merge11;
input month sales ;
datalines ;
1 3123
1 1234
2 7482
2 8912
3 1284
;
run;
data jul11.merge22;
input month goal ;
datalines;
1 4444
1 5555
1 8989
2 9099
2 8888
3 8989
;
run;
data jul11.merge1;
merge jul11.merge11 jul11.merge22 ;
by month;
difference =goal - sales ;
run;
proc print data=jul11.merge1 noobs;
run;
output:
month sales goal difference
1 3123 4444 1321
1 1234 5555 4321
1 1234 8989 7755
2 7482 9099 1617
2 8912 8888 -24
3 1284 8989 7705
Why it didn't match all observation in table 1 with in table 2 for common months ?
pdv retains data of observation to seek if any more observation are left for that particular by group before it reinitialises it , in that case it should have done cartesian product .
Gives perfect cartesian product for one to many merging but not for many to many .
This is because of how SAS processes the data step. A merge is never a true cartesian product (ie, all records are searched and matched up against all other records, like a SQL comma join might ); what SAS does (in the case of two datasets) is it follows down one dataset (the one on the left) and advances to the next particular by-group value; then it looks over on the right dataset, and advances until it gets to that by group value. If there are other records in between, it processes those singly. If there are not, but there is a match, then it matches up those records.
Then it looks on the left to see if there are any more in that by group, and if so, advances to the next. It does the same on the right. If only one of these has a match then it will only bring in those values; hence if it has 1 element on the left and 5 on the right, it will do 1x5 or 5 rows. However, if there are 2 on the left and 3 on the right, it won't do 2x3=6; it does 1:1, 2:2, and 2:3, because it's advancing record pointers sequentially.
The following example is a good way to see how this works. If you really want to see it in action, throw in the data step debugger and play around with it interactively.
data test1;
input x row1;
datalines;
1 1
1 2
1 3
1 4
2 1
2 2
2 3
3 1
;;;;
run;
data test2;
input x row2;
datalines;
1 1
1 2
1 3
2 1
3 1
3 2
3 3
;;;;
run;
data test_merge;
merge test1 test2;
by x;
put x= row1= row2=;
run;
If you do want to do a cartesian join in SAS datastep, you have to do nested SET statements.
data want;
set test1;
do _n_ = 1 to nobs_2;
set test2 point=_n_ nobs=nobs_2;
output;
end;
run;
That's the true cartesian, you can then test for by group equality; but that's messy, really. You could also use a hash table lookup, which works better with BY groups. There are a few different options discussed here.
SAS doesn't handle many-to-many merges very well within the datastep. You need to use a PROC SQL if you want to do a many-to-many merge.