Have a variable called var1 that has two kinds of values (both as character strings). One is "ND" the other is a number out of 0-100, as a string. I want to convert "ND" to 0 and the character string to a numeric value, for example 1(character) to 1(numeric).
Here's my code attempt:
data cleaned_up(drop = exam_1);
set dataset.df(rename=(exam1=exam_1));
select (exam1);
when ('ND') do;
exam1 = 0;
end;
when ;
exam1 = input(exam_1,2.);
end;
otherwise;
end;
Clearly not working. What am I doing wrong?
A couple of problems with your code. Putting the rename statement as a dataset option against the input dataset will perform the rename before the data is read in. Therefore exam1 won't exist as it is now called exam_1. This will still be defined as a character column, so the input function won't work.
You need to keep the existing column, create a new numeric column to do the conversion, then drop the old column and rename the new one. This can be done as a dataset option against the output dataset.
The tranwrd function will replace all occurrences of 'ND' to '0', then using input with the best12 informat will read in all the data as numbers. You don't have to specify the length when reading numbers (i.e. 2. for 2 digits, 3. for 3 digits etc).
data cleaned_up (drop=exam1 rename=(exam_1=exam1));
set df;
exam_1 = input(tranwrd(exam1,'ND','0'),best12.);
run;
You are using select(exam1) while it should be select(exam_1). You can use select for this purpose, but I think simple if condition can solve this much easier:
data test;
length source $32;
do source='99', '34.5', '105', 'ND';
output;
end;
run;
data result(drop = convertedValue);
set test;
if (source eq 'ND') then do;
result = 0;
end;
else do;
convertedValue = input(source,??best.);
if not missing(convertedValue) then do;
if (0 <= round(convertedValue, 1E-12) <= 100) then do;
result = convertedValue;
end;
end;
end;
run;
input(source,??best.) tries to convert source to number and if it fails (e.g. values contains some word), it does not print an error and simply continues execution.
round(convertedValue,1E-12) is used to avoid precision error during the comparison. If you want to do it absolutely safely you have to use something like
if (0 < round(convertedValue,1E-12) < 100
or abs(round(convertedValue,1E-12)) < 1E-10
or abs(round(convertedValue-100,1E-12)) < 1E-10
)
Try to use ifc function then convert to numeric variable.
data have;
input x $3.;
_x=input(ifc(x='ND','0',x),best12.);
cards;
3
10
ND
;
Related
Some context:
I have a string of digits (not ordered, but with known range 1 - 78) and I want to extract the digits to create specific variables with it, so I have
"64,2,3" => var_64 = 1; var_02 = 2; var_03 = 1; (the rest, like var_01 are all set to missing)
I basically came up with two solutions, one is using a macro DO loop and the other one a data step DO loop. The non-macro solution was to fist initialize all variables var_01 - var_78 (via a macro), then to put them into an array and then to gradually set the values of this array while looping through the string, word-by-word.
I then realized that it would be way easier to use the loop iterator as a macro variable and I came up with this MWE:
%macro fast(w,l);
do p = 1 to &l.;
%do j = 1 %to 9;
if &j. = scan(&w.,p,",") then var_0&j. = 1 ;
%end;
%do j = 10 %to 78;
if &j. = scan(&w.,p,",") then var_&j. = 1 ;
%end;
end;
%mend;
data want;
string = "2,4,64,54,1,4,7";
l = countw(string,",");
%fast(string,l);
run;
It works (no errors, no warnings, expected result) but I am unsure about mixing macro-DO-loops and non-macro-DO-loops. Could this lead to any inconsistencies or should I just stay with the non-macro solution?
Your current code is comparing numbers like 1 to strings like "1".
&j. = scan(&w.,p,",")
It will work as long as the strings can be converted into numbers, but it is not a good practice. It would be better to explicitly convert the strings into numbers.
input(scan(&w.,p,","),32.)
You can do what you want with an array. Use the number generated from the next item in the list as the index into the array.
data want;
string = "2,4,64,54,1,4,7";
array var_ var_01-var_78 ;
do index=1 to countw(string,",");
var_[input(scan(string,index,","),32.)]=1;
end;
drop index;
run;
My data is more than 70,000. I have more than 50 variables. (Var1 to Var50). In each variable, there are about about 30 groups (I'll use a to z). I am trying to get a selection of data using if statements. I'd like to select every data with the same group. Eg data in var 1 to 30 with a, data with var 1 to 30 in b.
I seem to be writing
If (Var1="a" and Var2="a" and Var3="a" and Var4="a" and all the way to var50=
"a") or (Var1="b" and Var2="a" and Var3="b" and Var4="b" and all the way to var50=
"b")...
How do I consolidate? I tried using an array but it didnt work and i was not sure if arrays work in the IF and then statement.
IF (VAR2="A" or VAR2="B" or VAR2="C" or VAR2="D"
or VAR3="A" or VAR3="B" or VAR3="C" or VAR3="D"
or VAR4="A" or VAR4="B" or VAR4="C" or VAR4="D"
or VAR5="A" or VAR5="B" or VAR5="C" or VAR5="D"
or VAR6="A" or VAR6="B" or VAR6="C" or VAR6="D"
or VAR7="A" or VAR7="B" or VAR7="C" or VAR7="D"
or VAR8="A" or VAR8="B" or VAR8="C" or VAR8="C"
or VAR9="A" or VAR9="B" or VAR9="C" or VAR9="D"
or VAR10="A" or VAR10="B" or I10_D10="C" or VAR10="D"
or VAR12="A" or VAR12="B" or VAR12="C" or VAR12="D"
or VAR13="A" or VAR13="B" or VAR13="C" or VAR13="D"
or VAR14="A" or VAR14="B" or VAR14="C" or VAR14="D"
or VAR15="A" or VAR15="B" or VAR15="C" or VAR15="D"
or VAR6="A" or VAR16="B" or VAR16="C" or VAR16="D"
or VAR17="A" or VAR17="B" or VAR17="C" or VAR17="D"
or VAR18="A" or VAR18="B" or VAR18="C" or VAR18="C"
or VAR19="A" or VAR19="B" or VAR19="C" or I10_D19="D"
or VAR20="A" or VAR20="B" or I10_D20="C" or VAR20="D"
or VAR21="D" or VAR22="A" or VAR22="B" or VAR22="C" or VAR22="D"
or VAR23="A" or VAR23="B" or VAR23="C" or VAR23="D"
or VAR24="A" or VAR24="B" or VAR24="C" or VAR24="D"
or VAR25="A" or VAR25="B" or VAR25="C" or VAR25="D"
or VAR26="A" or VAR26="B" or VAR26="C" or VAR26="D"
or VAR27="A" or VAR27="B" or VAR27="C" or VAR27="D"
or VAR28="A" or VAR28="B" or VAR28="C" or VAR28="C"
or VAR29="A" or VAR29="B" or VAR29="C" or VAR29="D"
or VAR30="A" or VAR30="B" or I10_D30="C" or VAR30="D")
then Group=1; else Group=0;
You probably don't need a macro, however a macro might be faster.
%let value=a;
data want;
set have;
array var[50];
keepit=1;
do i=1 to 50;
keepit = keepit and (var[i]="&value");
if ^keepit then
leave;
end;
if keepit;
drop i keepit;
run;
I create a signal variable and update it's value, it will be false if any value in the var[] array is not the &value. I leave the loop early if we find 1 non-matching value, to make it more efficient.
It's not exactly clear what you want. If you want to avoid checking all variables you can use WHICHC to find if any in a list are A.
X = whichc('a', of var1-var30);
If you want to see what different groups you have across all the variables, I think a big proc freq is what you want:
proc freq data=have noprint;
table var1*var2*var3*var4....*var30*gender*age / list out=table_counts;
run;
And then check the table_counts data set to see if that has what you want.
If neither of these are what you want, you need to add more details to your question. A sample of data and expected output would be perfect.
When I need to search several variables for a particular value what I will do is - combine all variables into one string and then search that string. Like this:
*** CREATE TEST DATA ***;
data have;
infile cards;
input VAR1 $ VAR2 $ VAR3 $ VAR4 $ VAR5 $;
cards;
J J K A M
S U I O P
D D D D D
l m n o a
Q U J C S
;
run;
data want;
set have;
*** USE CATS FUNCTION TO CONCATENATE ALL VAR# INTO ONE VARIABLE ***;
allvar = cats(var1, var2, var3, var4, var5);
*** IF NEEDED, APPLY UPCASE TO CONCATENATED VARIABLE ***;
*allvar = upcase(allvar);
*** USE INDEXC FUNCTION TO SEARCH NEW VARIABLE ***;
if indexc(allvar, "ABCD") > 0 then group = 1;
else group = 0;
run;
I'm not sure if this is exactly what you need, but hopefully this is something you can modify for your particular task.
The code as posted is testing if ANY of a list of variables match ANY of a list of values.
Let's make a simple test dataset.
data have ;
input id (var1-var5) ($);
cards;
1 E F G H I
2 A B C D E
;;;;
Make one array of the values you want to find and one array of the variables you want to check. Loop over the list of variables until you either find one that contains one of the values or you run out of variables to test.
data want ;
set have;
array values (4) $8 _temporary_ ('A' 'B' 'C' 'D');
array vars var1-var5 ;
group=0;
do i=1 to dim(vars) until (group=1);
if vars(i) in values then group=1;
end;
drop i;
run;
You could avoid the array for the list of values if you want.
if vars(i) in ('A' 'B' 'C' 'D') then group=1;
But using the array will allow you to make the loop run over the list of values instead of the list of variables.
do i=1 to dim(values) until (group=1);
if values(i) in vars then group=1;
end;
Which might be important if you wanted to keep the variable i to indicate which value (or variable) was first matched.
I have two datasets, both with same variable names. In one of the datasets two variables have character format, however in the other dataset all variables are numeric. I use the following code to convert numeric variables to character, but the numbers are changing by 490.6 -> 491.
How can I do the conversion so that the numbers wouldn't change?
data tst ;
set data (rename=(Day14=Day14_Character Day2=Day2_Character)) ;
Day14 = put(Day14_Character, 8.) ;
Day2 = put(Day2_Character, 8.) ;
drop Day14_Character Day2_Character ;
run;
Your posted code is confused. Half of it looks like code to convert from character to numeric and half looks like it is for the other direction.
To convert to character use the PUT() function. Normally you will want to left align the resulting string. You can use the -L modifier on the end of the format specification to left align the value.
So to convert numeric variables DAY14 and DAY2 to character variables of length $8 you could use code like this:
data want ;
set have (rename=(Day14=Day14_Numeric Day2=Day2_Numeric)) ;
Day14 = put(Day14_Numeric, best8.-L) ;
Day2 = put(Day2_Numeric, best8.-L) ;
drop Day14_Numeric Day2_Numeric ;
run;
Remember you use PUT statement or PUT() function with formats to convert values to text. And you use the INPUT statement or INPUT() function with informats to convert text to values.
Change the format to something like Best8.2:
data tst ;
set data (rename=(Day14=Day14_Character Day2=Day2_Character)) ;
Day14 = put(Day14_Character, best8.2) ;
Day2 = put(Day2_Character, best8.2) ;
drop Day14_Character Day2_Character ;
run;
Here is an example:
data test;
input r ;
datalines;
500.04
490.6
;
run;
data test1;
set test;
num1 = put(r, 8.2);
run;
If you do not want to specify the width and number of decimal points you can just use the BEST. informat and SAS will automatically assign the width and decimals based on the input data. However the length of the outcome variable may be large unless you specify it explicitly. This will still retain your numbers as in the original variable.
I want to be able to convert an entire column of dates this way. For example, 01/01/2017 to January 1, 2017. I realize there is a convoluted way of doing this but I am not entirely sure how i'd approach that logically. Also, does there happen to be a SAS format that does this? Thanks.
There does happen to be a format you can use. Here is a worked example using test data:
data test;
input datestring $;
datalines;
01/01/2017
;
run;
Using input converts the string value into a SAS date, and then the put function is used to create a character variable holding the representation you are looking for:
data test2;
set test;
date_as_date = input(datestring,ddmmyy10.);
date_formatted = put(date_as_date,worddate20.);
run;
The number 20 is used to describe a length that is long enough to hold the full value, using a lower number may result in truncation, e.g.
date_formatted = put(date_as_date,worddate3.);
put date_formatted=;
dateformatted=Jan
In some cases, the desired date format may NOT exist (in this case, it does 'worddate20.'), but as an example...
You could either write a function-style macro to convert a SAS date to "monname + day, year" format, e.g.
%MACRO FULLMDY(DT) ;
catx(', ',catx(' ',put(&DT,monname.),put(&DT,day.)),put(&DT,year4.))
%MEND ;
data example1 ;
dt = '26jul2017'd ;
fulldate = %FULLMDY(dt) ;
run ;
Or, you could build a custom format, covering all the dates which may exist in your data, e.g.
data alldates ;
retain fmtname 'FULLMDY' type 'N' ;
do dt = '01jan1900'd to '01jan2100'd ;
mdy = catx(', ',catx(' ',put(dt,monname.),put(dt,day.)),put(dt,year4.)) ;
output ;
end ;
rename dt = start
mdy = label ;
run ;
proc format cntlin=alldates ; run ;
data example2 ;
dt = '26jul2017'd ;
format dt fullmdy. ;
run ;
I'm new to sas and I have the following problem.
I have a variable that stores time but is a character, format $50. It looks like 30 min, 1.5 h, 5 h, 10 h. I need to convert it to numeric and calculate time in hours. I tried substrn function to extract numbers. but substrn(var, 1,2) gives 30, 1(instead of 1.5), 5, 10 and substrn(var, 1,3) gives 30, 1.5, .(instead of 5), 10. How to solve it?
Any help is appreciated.
Conversion from character to numeric is usually done using the input function. The second argument passes the expected informat (a rule telling SAS how to interpret the input).
You can use compress function (with the "k" option to keep rather than discard characters) to get just the numeric part of the character variable. Compress will remove certain characters from a value; the first argument passes the string for it to work on, the second argument lists the characters to remove, the third argument passes additional options (here "d" to add numerals to the list of characters to remove and "k" to invert the process. i.e. keep rather than remove the selected characters).
And, the index function can be used to identify the times when the string contains "m" for minutes. Index will return the position of the first occurrence of the the search string within the input. In the case if the input does not contain "m" it will return 0 and evaluate as FALSE in the if statement.
/* Create some input data */
data temp;
input time : $20.;
datalines;
1.5h
30min
120min
4.25hour
;
run;
data temp2;
set temp;
/* Extract only the numeric part of the string and convert to numeric */
newTime = input(compress(time, ".","dk"), best9.);
/* Check if the string contains the letter "m" and if so divide by 60 */
if index(time, "m") then newTime = newTime / 60;
run;
proc print;
run;
There's probably a way to create a custom informat that would deal with this, which I expect Joe or one of the other regulars here can advise you on. However, failing that, here's a function-based approach:
data have;
input time_raw $1-50;
cards;
30 min
1.5 h
5 h
10 h
;
run;
data want;
set have;
if index(time_raw, 'min') then do;
minutes = input(substr(time_raw,1,length(time_raw) - 4), 8.);
hours = 0;
end;
else do;
hours = input(substr(time_raw, 1, length(time_raw) - 2), 8.);
minutes = 0;
end;
format time time.;
time = hms(hours, minutes, 0);
run;