Adding blank columns between two variables in a sas dataset

Adding blank columns between two variables in a sas dataset - sas

I want to add a blank column between two variables in a dataset. The number of observations in each adjacent columns is 26. Hence, I want to insert a column in between these two columns which has 26 blank observation. Currently, my dataset looks like:
Variable names: A B C D
observations: 1 2 3 4
5 6 7 8
I want to add a column between B and C. The new dataset that I want should be as under:
Variable names: A B C D
Observations: 1 2 3 4
Is it possible to add blank columns having specific number of observations using SAS. May I please request help with this issue?

One simple way is to read in the old dataset in pieces using multiple SET statements using the KEEP= dataset option. So if your input dataset has variables A,B,C,D in that order you can insert a new variable after B using code like this.
data want;
set have(keep=a -- b);
length new1 $10 ;
set have ;
run;

Related

How to convert a SAS data set to a data step

How can I convert my SAS data set, into a data set that I can easily paste into the forum or hand over to someone to replicate my data. Ideally, I'd also like to be able to control the amount of records that are included.
Ie I have sashelp.class in the SASHELP library, but I want to provide it here so others can use it as the starting point for my question.

To do this, you can use a macro written by Mark Jordan at SAS, the code is stored in GitHub as well.
You need to provide the data set name, including library and the number of observations you want to output. It takes them in order. The code will then appear in your SAS log.
*data set you want to create demo data for;
%let dataSetName = sashelp.Class;
*number of observations you want to keep;
%let obsKeep = 5;
******************************************************
DO NOT CHANGE ANYTHING BELOW THIS LINE
******************************************************;
%let source_path = https://gist.githubusercontent.com/statgeek/bcc55940dd825a13b9c8ca40a904cba9/raw/865d2cf18f5150b8e887218dde0fc3951d0ff15b/data2datastep.sas;
filename reprex url "&source_path";
%include reprex;
filename reprex;
option linesize=max;
%data2datastep(dsn=&dataSetName, obs=&obsKeep);
This may not work if you do not have access to the github page, in that case, you can manually navigate to the page (same link) and copy/paste it into SAS. Then run the program and run only the last step, the %data2datastep(dsn=, obs=);

This topic came up recently on SAS Communities and I created a little more robust macro than the one Reeza linked. You can see it in Github: ds2post.sas
* Pull macro definition from GITHUB ;
filename ds2post url
'https://raw.githubusercontent.com/sasutils/macros/master/ds2post.sas'
;
%include ds2post ;
For example if you wanted to share the first 5 observations of SASHELP.CARS you would run this macro call:
%ds2post(sashelp.cars,obs=5)
Which would generate this code to the SAS log:
data work.cars (label='2004 Car Data');
infile datalines dsd dlm='|' truncover;
input Make :$13. Model :$40. Type :$8. Origin :$6. DriveTrain :$5.
MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway
Weight Wheelbase Length
;
format MSRP dollar8. Invoice dollar8. ;
label EngineSize='Engine Size (L)' MPG_City='MPG (City)'
MPG_Highway='MPG (Highway)' Weight='Weight (LBS)'
Wheelbase='Wheelbase (IN)' Length='Length (IN)'
;
datalines4;
Acura|MDX|SUV|Asia|All|36945|33337|3.5|6|265|17|23|4451|106|189
Acura|RSX Type S 2dr|Sedan|Asia|Front|23820|21761|2|4|200|24|31|2778|101|172
Acura|TSX 4dr|Sedan|Asia|Front|26990|24647|2.4|4|200|22|29|3230|105|183
Acura|TL 4dr|Sedan|Asia|Front|33195|30299|3.2|6|270|20|28|3575|108|186
Acura|3.5 RL 4dr|Sedan|Asia|Front|43755|39014|3.5|6|225|18|24|3880|115|197
;;;;
Try this little test to compare the two macros.
First make a sample dataset with a couple of issues.
data testit;
set sashelp.class (obs=5);
if _n_=1 then name='Le Bron';
if _n_=2 then age=.;
if _n_=3 then wt=.;
if _n_=4 then name='12;34';
run;
Then run both macros to dump code to the SAS log.
%ds2post(testit);
%data2datastep(dsn=testit,obs=20);
Copy the code from the log. Changing the name in the DATA statements to not overwrite the original dataset or each other. Run them and compare the result to the original.
proc compare data=testit compare=testit1; run;
proc compare data=testit compare=testit2; run;
Result using %DS2POST:
The COMPARE Procedure
Comparison of WORK.TESTIT with WORK.TESTIT1
(Method=EXACT)
Data Set Summary
Dataset Created Modified NVar NObs
WORK.TESTIT 02NOV18:17:09:40 02NOV18:17:09:40 6 5
WORK.TESTIT1 02NOV18:17:10:29 02NOV18:17:10:29 6 5
Variables Summary
Number of Variables in Common: 6.
Observation Summary
Observation Base Compare
First Obs 1 1
Last Obs 5 5
Number of Observations in Common: 5.
Total Number of Observations Read from WORK.TESTIT: 5.
Total Number of Observations Read from WORK.TESTIT1: 5.
Number of Observations with Some Compared Variables Unequal: 0.
Number of Observations with All Compared Variables Equal: 5.
Summary of results using %Data2DataStep:
Comparison of WORK.TESTIT with WORK.TESTIT2
(Method=EXACT)
Data Set Summary
Dataset Created Modified NVar NObs
WORK.TESTIT 02NOV18:17:09:40 02NOV18:17:09:40 6 5
WORK.TESTIT2 02NOV18:17:10:29 02NOV18:17:10:29 6 3
Variables Summary
Number of Variables in Common: 6.
Observation Summary
Observation Base Compare
First Obs 1 1
First Unequal 1 1
Last Unequal 3 3
Last Match 3 3
Last Obs 5 .
Number of Observations in Common: 3.
Number of Observations in WORK.TESTIT but not in WORK.TESTIT2: 2.
Total Number of Observations Read from WORK.TESTIT: 5.
Total Number of Observations Read from WORK.TESTIT2: 3.
Number of Observations with Some Compared Variables Unequal: 3.
Number of Observations with All Compared Variables Equal: 0.
Variable Values Summary
Values Comparison Summary
Number of Variables Compared with All Observations Equal: 1.
Number of Variables Compared with Some Observations Unequal: 5.
Number of Variables with Missing Value Differences: 4.
Total Number of Values which Compare Unequal: 12.
Maximum Difference: 0.
Variables with Unequal Values
Variable Type Len Ndif MaxDif MissDif
Name CHAR 8 1 0
Sex CHAR 1 3 3
Age NUM 8 2 0 2
Height NUM 8 3 0 3
Weight NUM 8 3 0 3
Note that I am sure there are values that will cause trouble for my macro also. But hopefully they are caused by data that is less likely to occur than spaces or semi-colons.

Row-wise operation for subset of columns

I have the following data:
data df;
input id $ d1 d2 d3;
datalines;
a . 2 3
b . . .
c 1 . 3
d . . .
;
run;
I want to apply some transformation/operation across a subset of columns. In this case, that means dropping all rows where columns prefixed with d are all missing/null.
Here's one way I accomplished this, taking heavy influence from this SO post.
First, sum all numeric columns, row-wise.
data df_total;
set df;
total = sum(of _numeric_);
run;
Next, drop all rows where total is missing/null.
data df_final;
set df_total;
where total is not missing;
run;
Which gives me the output I wanted:
a . 2 3
c 1 . 3
My issue, however, is that this approach assumes that there's only one "primary-key" column (id, in this case) and everything else is numeric and should be considered as a part of this sum(of _numeric_) is not missing logic.
In reality, I have a diverse array of other columns in the original dataset, df, and it's not feasible to simply drop all of them, writing all of that out. I know the columns for which I want to run this "test" all are prefixed with d (and more specifically, match the pattern d<mm><dd>).
How can I extend this approach to a particular subset of columns?

Use a different short cut reference, since you know it all starts with D,
total = sum( of D:);
if n(of D:) = 0 then delete;
Which will add variables that are numeric and start with D. If you have variables you want to exclude that start with D, that's problematic.
Since it's numeric, you can also use the N() function instead, which counts the non missing values in the row. In general though, SAS will do this automatically for most PROCS such as REG/GLM(not in a data step obviously).
If that doesn't work for some reason you can query the list of variables from the sashelp table.
proc sql noprint;
select name into :var_list separated by ", " from sashelp.vcolumn
where libname='WORK' and memname='DF' and name like 'D%';
quit;
data df;
set have;
if n(&var_list.)=0 then delete;
run;

SAS comparing data in a column

I'm very new to SAS and i'm trying to figure out my way around using it. I'm trying to figure out how to use the Compare procedure. Basically what I want to do is to see if the values in one column match the values in another column multiplied by 2 and count the number of mistakes. So if I have this data set:
a b
2 4
1 2
3 5
It should check whether b = 2 * a and tell me how many errors they are. I've been reading through the documentation for the compare procedure but like i said i'm very new and i can't seem to figure out how to check for this.

You could do if with PROC COMPARE but you still need to compute 2*a and you can't do that with PROC COMPARE. I would create a FLAG and summarize the FLAG. IFN function returns 1 for values that are NOT equal. PROC MEANS counts the 1's where mean is percent and sum is count of non-matching.
data comp;
input a b;
flag = ifn(b NE 2*a,1,0);
cards;
2 4
1 2
3 5
;;;;
run;
proc means n mean sum;
var flag;
run;

Proc compare compares values in two different datasets, whereas your variables are both in one dataset. The following may be simplest:
data matches errors;
set temp;
if b = 2 * a then output matches;
else output errors;
run;

Remove Variables that have too many invalid/missing values

Say that my data set has quite a lot of missing/invalid values and I would like to remove (or drop) the entire variable (or column) if it contains too many invalid values.
Take the following example, the variable 'gender' has quite a lot of "#N/A"s. I would like to remove that variable if a certain percentage of the data points in there are "#N/A"s, say more than 50%, more than 30%.
In addition, I would like to make the percentage a configurable value, i.e., I am willing to remove the entire variable if more than x% of the observations under that variable are "#N/A". And I also want to be able to define what an invalid value is, could be "#N/A", could be "Invalid Value", could be " ", could be anything else that I pre-define.
data dat;
input id score gender $;
cards;
1 10 1
1 10 1
1 9 #N/A
1 9 #N/A
1 9 #N/A
1 8 #N/A
2 9 #N/A
2 8 #N/A
2 9 #N/A
2 9 2
2 10 2
;
run;
Please make the solution as generalized as possible. For example, if the real data set contains thousands of variables, I need to be able to loop through all those variables instead of referencing their variable names one by one. Furthermore, the data set could contain more than just "#N/A" as bad values, other things like ".", "Invalid Obs", "N.A." could also exist at the same time.
PS: Actually I thought of a way to make this problem easier. We could probably read in all the data points as numerical values, so that all the "#N/A", "N.A.", " " stuff get turned into ".", which makes the drop criterion easier. Hope that helps you solve this problem for me ...
Update: below is the code I am working on. Got stuck at the last block.
data dat;
input id $ score $ gender $;
cards;
1 10 1
1 10 1
1 9 #N/A
1 9 #N/A
1 9 #N/A
1 8 #N/A
2 9 #N/A
2 8 #N/A
2 9 #N/A
2 9 2
2 10 2
;
run;
proc contents data=dat out=test0(keep=name type) noprint;
/*A DATA step is used to subset the test0 data set to keep only the character */
/*variables and exclude the one ID character variable. A new list of numeric*/
/*variable names is created from the character variable name with a "_n" */
/*appended to the end of each name. */
data test0;
set test0;
if type=2;
newname=trim(left(name))||"_n";
/*The macro system option SYMBOLGEN is set to be able to see what the macro*/
/*variables resolved to in the SAS log. */
options symbolgen;
/*PROC SQL is used to create three macro variables with the INTO clause. One */
/*macro variable named c_list will contain a list of each character variable */
/*separated by a blank space. The next macro variable named n_list will */
/*contain a list of each new numeric variable separated by a blank space. The */
/*last macro variable named renam_list will contain a list of each new numeric */
/*variable and each character variable separated by an equal sign to be used on*/
/*the RENAME statement. */
proc sql noprint;
select trim(left(name)), trim(left(newname)),
trim(left(newname))||'='||trim(left(name))
into :c_list separated by ' ', :n_list separated by ' ',
:renam_list separated by ' '
from test0;
quit;
/*The DATA step is used to convert the numeric values to character. An ARRAY */
/*statement is used for the list of character variables and another ARRAY for */
/*the list of numeric variables. A DO loop is used to process each variable */
/*to convert the value from character to numeric with the INPUT function. The */
/*DROP statement is used to prevent the character variables from being written */
/*to the output data set, and the RENAME statement is used to rename the new */
/*numeric variable names back to the original character variable names. */
data test2;
set dat;
array ch(*) $ &c_list;
array nu(*) &n_list;
do i = 1 to dim(ch);
nu(i)=input(ch(i),8.);
end;
drop i &c_list;
rename &renam_list;
run;
data test3;
set test2;
array myVars(*) &c_list;
countTotal=1;
do i = 1 to dim(myVars);
myCounter = count(.,myVars(i));
/* if sum(countMissing)/sum(countTotal) lt 0.5 then drop VNAME(myVars(i)); */
end;
run;
The problem is, and where I got stuck on, is that I am not able to drop the variables that I want to drop. And the reason is because I do not want to use the variable names in the drop function. Instead, I want it done in a loop where I can reference the variable names with the looper "i". I tried to use the array "myVars(i)" but it doesnt seem to work with the drop function.

My understanding is that SAS processes drop statements during data step compilation, i.e. before it looks at any of the data from any input datasets. Therefore, you cannot use the vname function like that to select variables to drop, as it doesn't evaluate the variable names until the data step has finished compiling and has moved on to execution.
You will need to output a temporary dataset or view containing all your variables, including the ones you don't want, build up a list of variables that you want to drop, in a macro variable, then drop them in a subsequent data step.
Refer to this paper and page 3 in particular for more details of which things run during compilation rather than execution:
http://www.lexjansen.com/nesug/nesug11/ds/ds04.pdf

In general, you'll find this sort of thing simplified using built in procs - this is SAS's bread and butter. You just need to restate the question.
What you want is to drop variables with a % of missing/bad data higher than 50%, so you need a frequency table of variables, right?
So - use PROC FREQ. This is the simplified version (only looks for "#N/A"), but it should be easy to modify the last step to make it look for other values (and to sum up the percents for them). Or, like you'll see in the linked question (from my comment on the question), you can use a special format that puts all invalid values to one formatted value, and all valid values to another formatted value. (You'll have to construct this format.)
Concept: use PROC FREQ to get frequency table, then look at that dataset to find the rows with > 50% of the rows and an invalid value in the F_ column.
This won't work with actual missing (" " or .); you'll need to add the /MISSING option to PROC FREQ if you have those also.
data dat;
input id $ score $ gender $;
cards;
1 10 1
1 10 1
1 9 #N/A
1 9 #N/A
1 9 #N/A
1 8 #N/A
2 9 #N/A
2 8 #N/A
2 9 #N/A
2 9 2
2 10 2
;
run;
*shut off ODS for the moment, and only use ODS OUTPUT, so we do not get a mess in our results window;
ods exclude all;
ods output onewayfreqs=freq_tables;
proc freq data=dat;
tables id score gender;
run;
ods output close;
ods exclude none;
*now we check for variables that match our criteria;
data has_missing;
set freq_tables;
if coalescec(of f_:) ='#N/A' and percent>50;
varname = substr(table,7);
run;
*now we put those into a macro variable to drop;
proc sql;
select varname
into :droplist separated by ' '
from has_missing;
quit;
*and we drop them;
data dat_fixed;
set dat;
drop &droplist.;
run;

merging all columns in sas dataset who has column "shiyas" in header

I have a sas dataset with columns shiyas1,shiyas2,shiyas3 in it. That dataset has some other columns also. I want to combine all the columns with header with shiyas in it.
We can't use cats(shiyas1,shiyas2,shiyas3) because similar datasets have columns upto shiyas10. As I am generating general sas code, we cannot use cats(shiyas1,shiyas2 .... shiyas10).
So how can we do this?
When I tried to use cats(shiyas1,shiyas2 .... shiyas10), eventhough my dataset have columns upto shiyas3, it created columns shiyas4 to shiyas10 with . filled in them.
SO one solution is to combine shiyas till the dataset have or to delete the unnecessary shiyas columns...
Pls help me.

Use variable list.
data have;
input (shiyas1-shiyas3) (:$1.);
cards;
1 2 3
;
data want;
set have;
length cat_shiyas $ 100 /*large enough to hold the content*/
;
cat_shiyas=cats(of shiyas:);
run;

Use the of statement (which lets you read across a row, similar to arrays) with the : wildcard operator. This will concatenate all columns beginning with 'shiyas'
cats(of shiyas:)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Adding blank columns between two variables in a sas dataset - sas

Related

How to convert a SAS data set to a data step

Row-wise operation for subset of columns

SAS comparing data in a column

Remove Variables that have too many invalid/missing values

merging all columns in sas dataset who has column "shiyas" in header

Categories

Resources