I am attempting to clean some data. Under a variable 'Education Level' I have multiple observations referring to holding a master's degree. Eg. "Masters" "Masters Degree" "Master's Degree". I have organized these into one value: "Masters Degree" using IF-THEN statements. However, I have another entry with trailing blanks named "Masters Degree " that isn't being picked up by the IF-THEN statements. How can I trim this down?
I've researched some functions to deal with this such as TRIM() but I don't really understand how I can implement these as I am new to SAS.
This is how I have been attempting to tidy my data and format I have used for the previous variables:
data libref.name;
set libref.name;
if Var1 = "Masters" then Var1 = "Masters Degree";
if Var1 = "Master's" then Var1 = "Masters Degree";
if Var1 = "Master Degree " then Var1 = "Masters Degree";
run;
I simply want to convert "Master Degree " observations into "Masters Degree"
That can't be the Problem because "Master Degree " is same as "Master Degree"
data _NULL_;
if "Master Degree " = "Master Degree" then put "EQUAL";
else put "DIFFERENT";
run;
Will output:
8 data _NULL_;
9 if "Master Degree " = "Master Degree" then put "EQUAL";
10 else put "DIFFERENT";
11 run;
EQUAL
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
That is because SAS ignores any trailing blanks by comparison. And that is because SAS will append trailing blanks when ever you string to an variable that is shorter than that variable.
However you said the other one is "Masters Degree" and thats different from "Master Degree"
If you want to convert everything that begins with "Master", then use the : operator after =, which works like a wildcard.
if Var1 =: "Master" then V22 = "Masters Degree"
Related
For context I'm a SAS programmer in clinical trials but I have this spec for variable ADTC.
If EC.ECDTC contains a full datetime, set ADTMC to the value of EC.ECDTC in "YYYY-MM-DD hh:mm" format. If EC.ECDTC contains a full or partial date but no time part then set ADTMC to the date part of EC.ECDTC in "YYYY-MM-DD" format. In both cases, replace any missing elements of the format with "XX", for example "2022-01-01 16:XX" or "2022-01-XX"
So currently I'm using this piece of code which is partially fine but not ideal
check=count(ecdtc,'-');
if check = 0 and ~missing(ecdtc) then adtc = cats(ecdtc,"-XX-XX");
else if check = 1 then adtc = cats(ecdtc,"-XX");
else if check = 2 then adtc = ecdtc;
Is there a way I could use perl-regular expressions to have like a template of the outline of the date/datetime and have it search through the values for that column and if they don't match to add -XX if missing day or -XX-XX if missing day and month etc. I was thinking of utilising prxchange but how do you incorporate the template so it knows to add -XX in the correct position where applicable.
SUBSTR on the left.
data want2;
set have;
length adtmc $16;
if length(ecdtc) le 10 then adtmc = 'xxxx-xx-xx';
else adtmc = 'xxxx-xx-xx xx:xx';
substr(adtmc,1,length(ecdtc))=ecdtc;
run;
Honestly, I wouldn't; regex are not faster for the most part than just straight-up checking with normal code, for simple things like this. If you have time pressure, or thousands or millions of rows... not a good idea, just use scan.
But that said, it's certainly possible, and somewhat interesting. We'll use PRXPOSN, which lets us iterate through the capture buffers, and "capture" each bit. This might need some tweaking, and you might need to capture/not capture the hyphens for example, but for my data this works - if your data is different, the regex will be different (and next time, post sample data!).
data have;
length ecdtc $16;
infile datalines truncover;
input #1 ecdtc $16.;
datalines;
2020-01-01 01:02
2020-01-02
2020-01
2020
junk
;;;;
run;
data want;
set have;
length adtmc $16;
array vals[3] $;
vals[1]='XXXX';
vals[2]='-XX';
vals[3]='-XX';
_rx = prxparse('/(\d{4})(-\d{2})?(-\d{2})?( \d{2}:\d{2})?/ios');
_rc = prxmatch(_rx,ecdtc); *this does the matching. Probably should check for value of _rc to make sure it matched before continuing.;
do _i = 1 to 4; *now iterate through the four capture buffers;
_rt = prxposn(_rx,_i,ecdtc);
if _i le 3 then vals[_i] = coalescec(_rt,vals[_i]);
else timepart = _rt; *we do the timepart outside the array since it needs to be catted with a space while the others do not, easier this way;
end;
adtmc = cats(of vals[*]); *cat them together now - if you do not capture the hyphen then use catx ('-',of vals[*]) instead;
if timepart ne ' ' then adtmc = catx(' ',adtmc,timepart); *and append the timepart after.;
run;
I would like to create one macro called 'currency_rate' which calls the correct value depending on the conditions stipulated
(either 'new_rate' or static value of 1.5):
%macro MONEY;
%Do i=1 %to 5;
data get_currency_&i (keep=month code new_rate currency_rate);
set Table1;
if month = &i and code = 'USD' then currency_rate=new_rate;
else currency_rate=1.5;
run;
data _null_;
set get_currency_&i;
if month = &i and code = 'USD' then currency_rate=new_rate;
else currency_rate=1.5;
call symput ('currency_rate', ???);
run;
%End;
%mend MONEY;
%MONEY
I am happy with the do loop and first data step. It is the call symput I am stuck on. Is call symput the correct function to use, to assign two possible values to one macro?
A snippet example of the way I will be using 'currency_rate' in a proc sql:
t1.income/¤cy_rate.
I am a beginner level SAS user, any guidance would be great!
Thanks
Let's simulate your case. Suppose we have 3 datasets, as shown below -
data get_currency_1;
input month code $ new_rate currency_rate;
cards;
1 USD 2 2
2 CHF 2 1.5
3 GBP 1 1.5
;
data get_currency_2;
input month code $ new_rate currency_rate;
cards;
1 USD 3 1.5
2 USD 4 4
3 JPY 0.5 1.5
;
data get_currency_3;
input month code $ new_rate currency_rate;
cards;
1 USD 1 1.5
2 USD 3 1.5
3 USD 2.5 2.5
;
Now, let's run your code where we assign a value to currency_rate.
Let i=1 So, the dataset get_currency_1 will be accessed. As we run the step, each and every row will be accessed and the value of currency_rate will be assigned to the macro variable currency_rate and this iteration will continue till the end of the data step. At this time, the last value will be of currency_rate will be the final value of macro variable currency_rate because beyond that the step ends.
%let i=1; /*Let's assign 1 to i*/
data _null_;
set get_currency_&i;
if month = &i and code = 'USD' then currency_rate=new_rate;
else currency_rate=1.5;
call symput ('currency_rate', currency_rate);
run;
%put Currency rate is: ¤cy_rate;
Currency rate is: 1.5
Let i=3:
%let i=3; /*Let's assign 3 to i*/
data _null_;
set get_currency_&i;
if month = &i and code = 'USD' then currency_rate=new_rate;
else currency_rate=1.5;
call symput ('currency_rate', currency_rate);
run;
%put Currency rate is: ¤cy_rate;
Currency rate is: 2.5
You cannot have multiple values on one macro variable.
You say you are a beginner, so the best course of action is to avoid macro programming at this point. You would be better served learning about where, merge (or join) and by statements.
You state you will need to use a currency_rate in a statement such as
t1.income / ¤cy_rate.
The t1. to me suggests t1 is an alias in a SQL join and thus the far more likely scenario is that you need to left join table t1 that contains incomes with table1 (call it monthly_datum) that contains the monthly currency rates.
select
t1.income / coaslesce(monthly_datum.currency_rates,1.5)
, …
from
income_data as t1
left join
monthly_datum
on t1.month = monthly_datum.month
The rate of 1.5 would be used when the income is associated with a month that is not present in monthly_datum.
A macro variable can only hold a single value.
Since you're only ever assigning a single value, you can easily use CALL SYMPUTX().
call symputx('currency_rate', currency_rate);
But if your data has more than one row, then the value will be the last value set in the data set.
I have a SAS table that I imported from Oracle with two fields. SYSTEMID and T_BLOB.
Inside the T_BLOB field there is data:
2203 Mountain Meadow===========OSCAR ST===========Zephyrhill Road
(why they are delimiting with equal signs I do not know nor do I know who to ask).
I'm new to SAS and I'm being asked to split T_BLOB field into multiple rows in a table called rick.split_blob. I tried Google but I can't find the exact example. I'm trying to get the output to look like:
SYSTEM_ID T_BLOB
GID_1 2203 Mountain Ave
GID_1 OSCAR ST
GID_1 Zephyrhill Road
Can anyone help me with how to code this?
If none of the values ever contain = then you can just use the scan() function.
data want;
set have ;
length T_BLOB_VALUE $200 ;
do i=1 by 1 until(t_blob_value=' ');
t_blob_value=scan(t_blob,i,'=') ;
if i=1 or t_blob_value ne ' ' then output;
end;
run;
You could try this:
data rick.split_blob (keep=SYSTEM_ID T_BLOB_SUB rename=(T_BLOB_SUB=T_BLOB));
set orig_dataset;
T_BLOB_TRANS = tranwrd(T_BLOB,"===========","|");
do i = 1 to countw(T_BLOB_TRANS,"|");
T_BLOB_SUB = scan(T_BLOB,i,"|");
output;
end;
run;
What I'm trying to do is first translate the odd string of equals signs to a simple pipe to avoid counting them as consecutive delimiters. Then we determine how many "words" (really - delimited strings) there are in T_BLOB_TRANS so we know how many times to run the DO loop. Finally we read everything between each delimiter and output it to a new T_BLOB variable for each new word.
It looks like you'll want to use a combination of the "scan" function and the "output" statement (with countw to get you the number of words if it is variable). Scan returns the nth word where you can specify the delimiter. Output outputs a record. So, for example, you can say
do i=1 to countw(line);
newvar = scan(line,i);
output;
end;
I have these two datasets here:
data ONE;
input ID LastName $ FirstInit $ 1.;
datalines;
509182793 Smith C
319861601 Williams J
345121778 Connor F
480863211 King L
907636280 Franklin D
729082859 Monroe T
835688938 Hall K
;
run;
data TWO;
input ID $ 11. State $ 2.;
datalines;
334-99-5246 TX
480-86-3211 MD
449-55-9407 VA
345-12-1778 GA
907-63-6280 NY
790-09-9813 WY
319-86-1601 FL
;
run;
I have two questions:
1) How would you use COMPRESS to create a new character variable, "ncv" and set the value of ncv to be the value of the character variable ID with the hyphens removed? Here's my attempt:
data TWO_NUMERIC;
set TWO;
ncv=COMPRESS(TWO, "+-", "d");
run;
2) How would you use PUT/INPUT to convert ncv to a numerical value to create a numeric variable, "newncv"
data TWO_NUMERIC;
set TWO;
put(TWO,z6.);
run;
To start off with these two questions, I start off with the DATA step and SET statements:
data TWO_NUMERIC;
set TWO;
run;
I looked SAS 9.2's help page but the use of these two statements in their example code seems to confuse me.
Ok, I was going to say RTM, but in this case it's not clear, at least not in my opinion.
Your mistake for compress is that the first parameter should be the variable, in this case ID, not the dataset TWO. In addition you only need to specify the - in your list, not +, unless you think there might be + in the variable as well. Adding the modifier D, specifies add digits to the remove list, which is the opposite of what you want.
Similar concept with PUT/INPUT, reference the variable and make sure you're using the correct function, in this case, input to convert it to numeric.
Data two_numeric;
set two;
ncv=COMPRESS(ID, "-");
ncv_num=input(ncv, 12.);
run;
Compress can be used in multiple ways, one way is described by #Reeza above and the other is using the "k" modifier, which means "keep" as shown below,
data TWO_NUMERIC;
set TWO;
ncv_d=COMPRESS(ID," ", "kd"); * kd means keep-digits, your code had TWO which is a dataset name;
ncv_n=COMPRESS(ID," ", "kn"); * kd means keep-numbers;
/* Input Function is used to convert CHAR to NUM *
* the best. format applies the nearest matching format */
newncv=input(ncv_d,best.);
run;
The link I found useful to explain the K modifier is http://www.amadeus.co.uk/sas-training/tips/1/1/11/the-enhanced-compress-function.php
Say that my data set has quite a lot of missing/invalid values and I would like to remove (or drop) the entire variable (or column) if it contains too many invalid values.
Take the following example, the variable 'gender' has quite a lot of "#N/A"s. I would like to remove that variable if a certain percentage of the data points in there are "#N/A"s, say more than 50%, more than 30%.
In addition, I would like to make the percentage a configurable value, i.e., I am willing to remove the entire variable if more than x% of the observations under that variable are "#N/A". And I also want to be able to define what an invalid value is, could be "#N/A", could be "Invalid Value", could be " ", could be anything else that I pre-define.
data dat;
input id score gender $;
cards;
1 10 1
1 10 1
1 9 #N/A
1 9 #N/A
1 9 #N/A
1 8 #N/A
2 9 #N/A
2 8 #N/A
2 9 #N/A
2 9 2
2 10 2
;
run;
Please make the solution as generalized as possible. For example, if the real data set contains thousands of variables, I need to be able to loop through all those variables instead of referencing their variable names one by one. Furthermore, the data set could contain more than just "#N/A" as bad values, other things like ".", "Invalid Obs", "N.A." could also exist at the same time.
PS: Actually I thought of a way to make this problem easier. We could probably read in all the data points as numerical values, so that all the "#N/A", "N.A.", " " stuff get turned into ".", which makes the drop criterion easier. Hope that helps you solve this problem for me ...
Update: below is the code I am working on. Got stuck at the last block.
data dat;
input id $ score $ gender $;
cards;
1 10 1
1 10 1
1 9 #N/A
1 9 #N/A
1 9 #N/A
1 8 #N/A
2 9 #N/A
2 8 #N/A
2 9 #N/A
2 9 2
2 10 2
;
run;
proc contents data=dat out=test0(keep=name type) noprint;
/*A DATA step is used to subset the test0 data set to keep only the character */
/*variables and exclude the one ID character variable. A new list of numeric*/
/*variable names is created from the character variable name with a "_n" */
/*appended to the end of each name. */
data test0;
set test0;
if type=2;
newname=trim(left(name))||"_n";
/*The macro system option SYMBOLGEN is set to be able to see what the macro*/
/*variables resolved to in the SAS log. */
options symbolgen;
/*PROC SQL is used to create three macro variables with the INTO clause. One */
/*macro variable named c_list will contain a list of each character variable */
/*separated by a blank space. The next macro variable named n_list will */
/*contain a list of each new numeric variable separated by a blank space. The */
/*last macro variable named renam_list will contain a list of each new numeric */
/*variable and each character variable separated by an equal sign to be used on*/
/*the RENAME statement. */
proc sql noprint;
select trim(left(name)), trim(left(newname)),
trim(left(newname))||'='||trim(left(name))
into :c_list separated by ' ', :n_list separated by ' ',
:renam_list separated by ' '
from test0;
quit;
/*The DATA step is used to convert the numeric values to character. An ARRAY */
/*statement is used for the list of character variables and another ARRAY for */
/*the list of numeric variables. A DO loop is used to process each variable */
/*to convert the value from character to numeric with the INPUT function. The */
/*DROP statement is used to prevent the character variables from being written */
/*to the output data set, and the RENAME statement is used to rename the new */
/*numeric variable names back to the original character variable names. */
data test2;
set dat;
array ch(*) $ &c_list;
array nu(*) &n_list;
do i = 1 to dim(ch);
nu(i)=input(ch(i),8.);
end;
drop i &c_list;
rename &renam_list;
run;
data test3;
set test2;
array myVars(*) &c_list;
countTotal=1;
do i = 1 to dim(myVars);
myCounter = count(.,myVars(i));
/* if sum(countMissing)/sum(countTotal) lt 0.5 then drop VNAME(myVars(i)); */
end;
run;
The problem is, and where I got stuck on, is that I am not able to drop the variables that I want to drop. And the reason is because I do not want to use the variable names in the drop function. Instead, I want it done in a loop where I can reference the variable names with the looper "i". I tried to use the array "myVars(i)" but it doesnt seem to work with the drop function.
My understanding is that SAS processes drop statements during data step compilation, i.e. before it looks at any of the data from any input datasets. Therefore, you cannot use the vname function like that to select variables to drop, as it doesn't evaluate the variable names until the data step has finished compiling and has moved on to execution.
You will need to output a temporary dataset or view containing all your variables, including the ones you don't want, build up a list of variables that you want to drop, in a macro variable, then drop them in a subsequent data step.
Refer to this paper and page 3 in particular for more details of which things run during compilation rather than execution:
http://www.lexjansen.com/nesug/nesug11/ds/ds04.pdf
In general, you'll find this sort of thing simplified using built in procs - this is SAS's bread and butter. You just need to restate the question.
What you want is to drop variables with a % of missing/bad data higher than 50%, so you need a frequency table of variables, right?
So - use PROC FREQ. This is the simplified version (only looks for "#N/A"), but it should be easy to modify the last step to make it look for other values (and to sum up the percents for them). Or, like you'll see in the linked question (from my comment on the question), you can use a special format that puts all invalid values to one formatted value, and all valid values to another formatted value. (You'll have to construct this format.)
Concept: use PROC FREQ to get frequency table, then look at that dataset to find the rows with > 50% of the rows and an invalid value in the F_ column.
This won't work with actual missing (" " or .); you'll need to add the /MISSING option to PROC FREQ if you have those also.
data dat;
input id $ score $ gender $;
cards;
1 10 1
1 10 1
1 9 #N/A
1 9 #N/A
1 9 #N/A
1 8 #N/A
2 9 #N/A
2 8 #N/A
2 9 #N/A
2 9 2
2 10 2
;
run;
*shut off ODS for the moment, and only use ODS OUTPUT, so we do not get a mess in our results window;
ods exclude all;
ods output onewayfreqs=freq_tables;
proc freq data=dat;
tables id score gender;
run;
ods output close;
ods exclude none;
*now we check for variables that match our criteria;
data has_missing;
set freq_tables;
if coalescec(of f_:) ='#N/A' and percent>50;
varname = substr(table,7);
run;
*now we put those into a macro variable to drop;
proc sql;
select varname
into :droplist separated by ' '
from has_missing;
quit;
*and we drop them;
data dat_fixed;
set dat;
drop &droplist.;
run;