First time writing SAS code. Trying to add a new numeric variable that has a conditional value based on another existing variable. This is what I have so far:
data dataset;
set dataset;
newnum = .;
If oldnum >= 2;
newnum = 1;
run;
I'm getting an error when I try to run this SAS Code node that is attached to the relevant data source item.
There is no conditional statement in your code because you end your IF block with a semicolon. You need to move the newnum up to before the semicolon and add a THEN.
data dataset;
set dataset;
newnum = .;
If oldnum >= 2 then newnum = 1;
run;
As is, the line If oldnum >=2 will filter your dataset and keep only records that have oldnum>=2. Then it will assign newnum=1. So it does do what you want, but the resulting dataset will only be a subset of the original data set.
Related
I have a questionnaire coded 1-5 and then labeled as (.) for missing variables. How do I code the data to reflect the following:
If patient has =>80% values not missing than missing values will be coded as the mean value of the questions answered. If patient is missing more than 80% of values than set measure summary to missing for patient, drop record.
condomuse;
set int108;
run;
proc means data=condomuse n nmiss missing;
var cusesability CUSESPurchase CUSESCarry CUSESDiscuss CUSESSuggest CUSESUse CUSESMaintain CUSESEmbarrass CUSESReject CUSESUnsure CUSESConfident CUSESComfort CUSESPersuade CUSESGrace CUSESSucceed;
by Intround sid;
run;
Using the following assumptions:
each line/record is a unique person
all variables are numeric
NMISS(), N(), CMISS() and DIM() are functions that can work with arrays.
This will identify all records with 80% or more missing.
data temp; *temp is output data set name;
set have; *have is input data set name;
*create an array to avoid listing all variables later;
array vars_check(*) cusesability CUSESPurchase CUSESCarry CUSESDiscuss CUSESSuggest CUSESUse CUSESMaintain CUSESEmbarrass CUSESReject CUSESUnsure CUSESConfident CUSESComfort CUSESPersuade CUSESGrace CUSESSucceed;
*calculate percent missing;
Percent_Missing = NMISS(of vars_check(*)) / Dim(vars_check);
if percent_missing >= 0.8 then exclude = 'Y';
else exclude = 'N';
run;
To replace with mean or a different method, PROC STDIZE can do that.
*temp is input data set name from previous step;
proc stdize data=temp out=temp_mean reponly method=mean;
*keep only records with more than 80%;
where exclude = 'N';
*list of vars to fill with mean;
VAR cusesability CUSESPurchase CUSESCarry CUSESDiscuss CUSESSuggest CUSESUse CUSESMaintain CUSESEmbarrass CUSESReject CUSESUnsure CUSESConfident CUSESComfort CUSESPersuade CUSESGrace CUSESSucceed;
run;
The different methods for standardization are here, but these are standardization methods not imputation methods.
I'm facing the problem that I want to put data into a character variable.
So I have a long tranposed dataset where I have three variables: date( by which i transposed before hand) var (has three different outputs of my previous variables) and col1 (which includes the values of my previous variables).
Now i want to create a forth variable which has as well three different outputs. My problem is that I can create the variable put with my code it does always create missing value.
data pair2;
set data1;
if var="BNNESR" or var="BNNESR_r" or var="BNNESR_t" then output;
length all $ 20;
all=" ";
if var="BNNESR" then all="pdev";
if var="BNNESR_t" then all="trigger";
if var="BNNESR_r" then all="rdev";
drop var;
run;
Afterwards I want to tranpose it back by the "all" variable. I know i could just rename the old vars before I transpose it and then just keep them.
But the complete calculation will go on and actually will be turned into a macro where it is not that easy if would do it like that way.
Your program will just subset the input data and add a new variable that is empty because you are writing the data out before you assign any value to the new variable.
Use a subsetting IF (or WHERE) statement instead of using an explicit OUTPUT statement. Once your data step has an explicit OUTPUT statement then SAS no longer automatically writes the observation at the end of the data step iteration.
data pair2;
set data1;
if var="BNNESR" or var="BNNESR_r" or var="BNNESR_t" ;
length all $20;
if var="BNNESR" then all="pdev";
else if var="BNNESR_t" then all="trigger";
else if var="BNNESR_r" then all="rdev";
drop var;
run;
Since the list in the IF statement matches the values in the recode step then perhaps you want to just use a DELETE statement instead?
data pair2;
set data1;
length all $20;
if var="BNNESR" then all="pdev";
else if var="BNNESR_t" then all="trigger";
else if var="BNNESR_r" then all="rdev";
else delete;
drop var;
run;
I have written this code to do this :
read records in the table "not_identified" one by one
for one record pass the "name_firstname" variable to a macro named "mCalcul_lev_D33",
then, the macro calculates the Levenstein between the variable passed as parameter and all the values of the variable "name_firstname_in_D33" in "data_all" table,
if the Levenstein returns a value less or equal to "3", then the record of "data_all" is copied to "lev_D33" table.
rsubmit;
%macro mCalcul_lev_D33(theName);
data result.lev_D33;
set result.data_all;
name_LEV=complev(&theName, name_firstname_in_D33);
if name_LEV<=3 then output;
run;
%mend mCalcul_lev_D33;
endrsubmit;
rsubmit;
data _null_;
set result.not_identified;
call execute ('%mCalcul_lev_D33('||name_firstname||')');
;
run;
endrsubmit;
There is 53700000 records in "data_all". The code is running since yesterday. Because I cannot see the result, I am asking :
Is the code doing what I want?
How coding if I want to write "name_firstname" (the variable passed like parameter) in the beginning of each record of "lev_D33"?
Thank you!
D.O.:
I posit your macros are making the task more difficult than need be. There appears to be an coding problem in that each row in not_identified record will cause the result.lev_D33 to be rebuilt. If your long running program ever does finish, the lev_D33 output data set will correspond to only the last not_identified.
You are doing full outer join comparing ALL_COUNT * NOT_IDENT_COUNT rows in the process.
How many rows are in not_identified ?Hopefully far less than data_all.
Is the result libname pointing to a network drive or remote server ?Networking i/o can make things run a very long time and even win you a phone call from the network team.
A full outer join in DATA Step can be done with nested loops and a point= on the inner loop SET. In DATA Step the outer loop is the implicit loop.
Consider this sample code:
data all_data;
do row = 1 to 100;
length name_firstname $20;
name_firstname
= repeat (byte(65 + mod(row,26)), 4*ranuni(123))
|| repeat(byte(65 + 26*ranuni(123)), 4*ranuni(123))
;
output;
end;
run;
data not_identified;
do row = 1 to 10;
length name_firstname $20;
name_firstname = repeat (byte(65 + mod(row,26)), 10*ranuni(123));
output;
end;
run;
data lev33;
set all_data;
do check_row = 1 to check_count;
set not_identified (keep=name_firstname rename=name_firstname=check_name)
nobs=check_count
point=check_row
;
name_lev = complev (check_name, name_firstname);
if name_lev <= 3 then output;
end;
run;
This approach tests each not_identified before moving to the next row. This is a useful method when the all_data is very large and you might want to process chunks of it at a time. Chunk processing is an appropriate place to start macro coding:
%macro do_chunk (FROM_OBS=, TO_OBS=);
data lev33_&FROM_OBS._&TO_OBS;
set all_data (firstobs=&FROM_OBS obs=&TO_OBS);
do check_row = 1 to check_count;
set not_identified (keep=name_firstname rename=name_firstname=check_name)
nobs=check_count
point=check_row
;
name_lev = complev (check_name, name_firstname);
if name_lev <= 3 then output;
end;
run;
%mend;
%macro do_chunks;
%local index;
%do index = 1 %to 100 %by 10;
%do_chunk ( FROM_OBS=&index, TO_OBS=%eval(&index+9) )
%end;
%mend;
%do_chunks
You might shepherd the whole the process, bypassing do_chunks and manually invoking do_chunk for various ranges of your choosing.
Thanks to #Richard. I have used your second example to write this code :
rsubmit;
data result.lev_D33;
set result.not_identified (firstobs=1 obs=10);
do check_row = 1 to 1000000;
set &lib..data_all (firstobs=1 obs=1000000) point=check_row;
name_lev = complev (name_firstname, name_firstname_D3);
if name_lev <= 3 then output;
end;
run;
endrsubmit ;
And it worked like I wanted.
In this example, I compare name_firstname in not_identified table to all name_firstname_D3 in data_all. If the COMPLEV is less or equal to 3, then the merge of the 2 records are in the result table "lev_D33" (one record from not_identified is merged to one record from data_all).
To do a test, I taked 10 records from not_identified and tried to find a concordance of the names and the firstnames in 1000000 data_all only.
I want to be able to create a flag, here called timeflag, that is set to 1 for every first and last entry of a certain Session, designated by logflag. What I have is the following but this gives me null data points:
data OUT.TENMAY_TIMEFLAG;
set IN.TENMAY_LOGFLAG;
if first.logflag then timeflag = 1;
if last.logflag then timeflag = 1;
run;
What is it about the first. and last. functions that I am not understanding here or is it that I have 2 if statements?
To have SAS create FIRST. and LAST. automatic variables you need to use a BY statement. If you want the new variable to be coded 1/0 then no need for the IF statement, just assign the automatic variable to a new permanent variable. To make one variable that is 1 for the first and the last then just use an OR.
data want;
set have;
by logflag ;
timeflag = first.logflag or last.logflag ;
run;
data OUT.TENMAY_TIMEFLAG;
set IN.TENMAY_LOGFLAG;
by logflag;
if first.logflag then timeflag = 1;
if last.logflag then timeflag = 1;
run;
P.S. in this case the dataset IN.TENMAY_LOGFLAG should be sorted by logflag.
I have a column in my sas file as age and another column as finalage. I want to substitute the values in age column by values in agefinal column for just one ID (that is 5)
The code that I used was:
Data temp;
set temp;
if ID = 5;
then age = agefinal;
run;
I could not substitute the values. The values in age column did not change. I tried to run this code to check the character length of values since character type is numeric for both the columns.
Code:
Proc contents data = temp;
tables age agefinal;
run;
The output that I got was:
age : character length 3.
agefinal: character length $3
I would appreciate your suggestions.
Try removing the semicolon at the end of the if statement. Right now what you're doing is deleting all records where the id isn't equal to five.
Try setting the formats to be the same
data temp;
modify temp;
format age agefinal $3.;
run;
and then see if it will let you do the substitution.
The code you provided runs with an ERROR, remove the additional semicolon and that may fix your issue:
/* ORIGINAL */
Data temp;
set temp;
if ID = 5;
then age = agefinal;
run;
/* CORRECTED */
Data temp;
set temp;
if ID = 5 /* REMOVED SEMICOLON */
then age = agefinal;
run;
Cheers
Rob