SAS: How to vertically join multiple datasets where a variable is numeric in one dataset and character in the other - sas

This is complicated a little because we're using a pipeline with a filelist to compile the data, so there are 50+ datasets coming in. I need to combine many, many datasets vertically, but var2 is numeric in some and character in others. Var1 is not important, so we can drop it, but when I try to drop it in the data step, it is throwing an error because of the differing data types. More details below.
Here's what I want to do at it's most basic...
data in1;
input var1 $ var2
datalines;
a 1
b 2
;
data in2;
input var1 $ var2 $
datalines;
a 1a
b 2b
;
data newdd;
set in1 in2;
run;
Is it possible to combine these datasets in the "data newdd" step without changing the inputs? Is there a way to drop var2 in this data step in a way that will still let it merge var1 and not throw an error? Or better yet, can I make var2 read in as character in all cases?

To drop var2 use a data set option.
data newdd;
set in1(drop=var2) in2(drop=var2);
run;
To combine and ensure it's the same:
data newdd;
set in1 (in=t1 rename=var2=_var2) in2(in=t2 );
if t1 then var2 = put(_var2, 8. -l);
run;
Long answer - you need to fix how you read in the files - all 50 from source so they're consistent. You can have SAS generate the correct types/INPUT statement if you have a master list of variables and type/length/format/informat, type at minimum.

Related

Outliers for all numerical values to mean SAS

I am working in SAS with a dataset with a lot of numeric values which I have standardised as follows:
proc standard data=df mean=0 std=1
out=df;
run;
Is there any easy way to deal with outliers (+/- 3standard deviation) for all numeric values? Ideally I would want to change all of those to + or - 3x standard deviation, or in worst case remove them.
You have to run through the data twice. There are may ways you can adjust your output. Here's a simple way using a datastep:
Assuming your dataset has a standardized variable called 'test':
Data adjusted;
set df;
if test > 3 then test=3;
if test < -3 then test =-3;
run;
just remember your new dataset will no longer have a mean of 0 and a standard deviation of 1
No sample data provided so I generated 5 random variables with N(0,2) distribution for the purpose of demonstration for removing outliers from N(0,1).
If you have multiple columns to remove outliers from, you could create a macro or just loop through an array.
DATA have;
INPUT var1 var2 var3 var4 var5;
DATALINES;
-0.8458048655231136 -2.1737985573160485 -2.122482432573275 1.8746296707673902 -2.799009287469253
-1.9927731684115295 1.8230096873238637 0.5964656531490122 -1.6465532407305106 3.9430012045284184
0.0294083016125659 1.3877418982525658 -1.3398372120124733 -0.8195179339297752 4.742490300459201
-0.5215716306745832 -3.35412129416837 1.1558155344985737 -1.0073681302151822 2.425914724408619
-2.817574234024364 3.5161858163738424 -2.1822454739704744 0.060674570200235534 0.25898913069677443
-3.941905381717187 4.969013776451821 2.021891632999466 -1.1526212617289868 1.2864391876960568
;
run;
* variable of all columns to remove outliers from ;
%LET column_names=var1 var2 var3 var4 var5;
DATA want;
SET have;
ARRAY columns {*} &column_names.;
DO i=1 to dim(columns);
if columns[i]>3 then columns[i]=3;
if columns[i]<-3 then columns[i]=-3;
END;
DROP i;
RUN;

Union tables with different columns

I have two large tables (~1GB each) with many different columns on which I want to perform a union all in sas.
Currently, I use the following method with proc sql and union all.
SELECT A, B, '' as C from Table_1
UNION ALL
SELECT '' as A, B, C from Table_2
However, this is not preferable as I have dozens of rows in both tables and I am constantly adding to them. Therefore, I am looking for a way to automatically create the blank columns without having to explicitly write them out.
I also tried the following query:
select * from
(select * from Table_1),
(select * from Table_2)
However, this seems very computationally intensive and takes forever to run.
Are there any better ways to do this? I am also open to using data set instead of proc sql;
A simple data step should do a thing:
data result_tab;
set Table_1 Table_2;
run;
This will rewrite both tables. Records from Table_2 will be added at the end of the result_tab. Set statement in data step will declare variables from both input tables.
Unfortunately, PROC SQL does require all dataset to have the same variables when using UNION. If you can use DATA SET then PROC SORT NODUPKEY that would be simplest (maybe not most efficient). To use PROC SQL, uou need to assign NULL values to the missing variables. For example:
data dset1;
input var1 var2;
datalines;
1 2
2 2
3 2
;
run;
data dset2;
input var1 var3;
datalines;
4 1
5 1
6 1
;
run;
PROC SQL;
CREATE TABLE dset3 AS
SELECT var1, var2, . AS var3 FROM dset1
UNION
SELECT var1, . AS var2, var3 FROM dset2
QUIT;
PROC PRINT DATA=dset3; RUN;

How do I stop SAS from adding an extra empty byte to every string variable when I use PROC EXPORT?

When I export a dataset to Stata format using PROC EXPORT, SAS 9.4 automatically expands adds an extra (empty) byte to every observation of every string variable. For example, in this data set:
data test1;
input cust_id $ 1
month 3-8
category $ 10-12
status $ 14-14
;
datalines;
A 200003 ABC C
A 200004 DEF C
A 200006 XYZ 3
B 199910 ASD X
B 199912 ASD C
;
quit;
proc export data = test1
file = "test1.dta"
dbms = stata replace;
quit;
the variables cust_id, category, and status should be str1, str3, and str1 in the final Stata file, and thus take up 1 byte, 3 bytes, and 1 byte, respectively, for every observation. However, SAS automatically adds an extra empty byte to each observation, which expands their data types to str2, str4, and str2 data type in the outputted Stata file.
This is extremely problematic because that's an extra byte added to every observation of every string variable. For large datasets (I have some with ~530 million observations and numerous string variables), this can add several gigabytes to the exported file.
Once the file is loaded into Stata, the compress command in Stata can automatically remove these empty bytes and shrink the file, but for large datasets, PROC EXPORT adds so many extra bytes to the file that I don't always have enough memory to load the dataset into Stata in the first place.
Is there a way to stop SAS from padding the string variables in the first place? When I export a file with a one character string variable (for example), I want that variable stored as a one character string variable in the output file.
This is how you can do it using existing functions.
filename FT41F001 temp;
data _null_;
file FT41F001;
set test1;
put 256*' ' #;
__s=1;
do while(1);
length __name $32.;
call vnext(__name);
if missing(__name) or __name eq: '__' then leave;
substr(_FILE_,__s) = vvaluex(__name);
putlog _all_;
__s = sum(__s,vformatwx(__name));
end;
_file_ = trim(_file_);
put;
format month f6.;
run;
To avoid the use of _FILE_;
data _null_;
file FT41F001;
set test1;
__s=1;
do while(1);
length __name $32. __value $128 __w 8;
call vnext(__name);
if missing(__name) or __name eq: '__' then leave;
__value = vvaluex(__name);
__w = vformatwx(__name);
put __value $varying128. __w #;
end;
put;
format month f6.;
run;
If you are willing to accept a flat file answer, I've come up with a fairly simple way of generating one that I think has the properties you require:
data test1;
input cust_id $ 1
month 3-8
category $ 10-12
status $ 14-14
;
datalines;
A 200003 ABC C
A 200004 DEF C
A 200006 XYZ 3
B 199910 SD X
B 199912 D C
;
run;
data _null_;
file "/folders/myfolders/test.txt";
set test1;
put #;
_FILE_ = cat(of _all_);
put;
run;
/* Print contents of the file to the log (for debugging only)*/
data _null_;
infile "/folders/myfolders/test.txt";
input;
put _infile_;
run;
This should work as-is, provided that the total assigned length of all variables in your dataset is less than 32767 (the limit of the cat function in the data step environment- the lower 200 character limit doesn't apply, as that's only when you use cat to create a variable that hasn't been assigned a length). Beyond that you may start to run into truncation issues. A workaround when that happens is to only cat together a limited number of variables at a time - a manual process, but much less laborious than writing out put statements based on the lengths of all the variables, and depending on your data it may never actually come up.
Alternatively, you could go down a more complex macro route, getting variable lengths from either the vlength function or dictionary.columns and using those plus the variable names to construct the required put statement(s).

In SAS, is there a faster way to create an empty variable if it doesn't exist?

Currently I'm using a method similar to that used in a previous question,
Test if a variable exists
but with a small modification to make it able to handle larger numbers of variables more easily. The following code ensures that n6 has the same variables as the data set referenced by dsid2.
data n6;
set n5;
dsid=open('n5');
dsid2=open(/*empty template dataset*/);
varsn=attrn(dsid2,nvars);
i=1;
do until i = varsn;
if varnum(dsid,varname(dsid2,i))=0 then do;
varname(dsid2,i)="";
format varname(dsid2,i) varfmt(dsid2,i);
end;
i=i+1;
end;
run;
If I understand correctly, SAS will run through the entire do loop for each observation. I'm beginning to experience slow run times as I begin to use larger data sets, and I was wondering if anyone has a better technique?
If possible, the simplest approach is to apply your regular logic to your new dataset. Worry about matching the variables later. When you are done with processing you can create an empty version of the template dataset like this:
data empty;
set template(obs=0);
run;
and then merge empty and your new dataset:
data template;
input var1 var2 var3;
datalines;
7 2 2
5 5 3
7 2 7
;
data empty;
set template(obs=0);
run;
data todo;
input var1 var2;
datalines;
1 2
;
data merged;
merge todo empty;
run;
In this example the merged dataset will have var3 with the value missing.

How to create a new variable in SAS by extracting part of the value of an existing numeric variable?

I have two datasets in SAS that I would like to merge, but they have no common variables. One dataset has a "subject_id" variable, while the other has a "mom_subject_id" variable. Both of these variables are 9-digit codes that have just 3 digits in the middle of the code with common meaning, and that's what I need to match the two datasets on when I merge them.
What I'd like to do is create a new common variable in each dataset that is just the 3 digits from within the subject ID. Those 3 digits will always be in the same location within the 9-digit subject ID, so I'm wondering if there's a way to extract those 3 digits from the variable to make a new variable.
Thanks!
SQL(using sample data from Data Step code):
proc sql;
create table want2 as
select a.subject_id, a.other, b.mom_subject_id, b.misc
from have1 a JOIN have2 b
on(substr(a.subject_id,4,3)=substr(b.mom_subject_id,4,3));
quit;
Data Step:
data have1;
length subject_id $9;
input subject_id $ other $;
datalines;
abc001def other1
abc002def other2
abc003def other3
abc004def other4
abc005def other5
;
data have2;
length mom_subject_id $9;
input mom_subject_id $ misc $;
datalines;
ghi001jkl misc1
ghi003jkl misc3
ghi005jkl misc5
;
data have1;
length id $3;
set have1;
id=substr(subject_id,4,3);
run;
data have2;
length id $3;
set have2;
id=substr(mom_subject_id,4,3);
run;
Proc sort data=have1;
by id;
run;
Proc sort data=have2;
by id;
run;
data work.want;
merge have1(in=a) have2(in=b);
by id;
run;
an alternative would be to use
proc sql
and then use a join and the substr() just as explained above, if you are comfortable with sql
Assuming that your "subject_id" variable is a number then the substr function wont work as sas will try convert the number to a string. But by default it pads some paces on the left of the number.
You can use the modulus function mod(input, base) which returns the remainder when input is divided by base.
/*First get rid of the last 3 digits*/
temp_var = floor( subject_id / 1000);
/* then get the next three digits that we want*/
id = mod(temp_var ,1000);
Or in one line:
id = mod(floor(subject_id / 1000), 1000);
Then you can continue with sorting the new data sets by id and then merging.