Can anyone help me to resolve this?
I have a very large raw dataset with a character variable that contains text strings along with numbers & dates defined in character format. Now I want to process the dataset and create a new numeric variable and populate values only when the text in the actual variable is either a number or a date value. Otherwise missing
RAWDATA:
ACTUAL_VARIABLE NEW_NUM_VARIABLE(Expected Values)
------------------ ---------------------------------
ODed on pills threw them all up - 2006
Y
1 1
5 5
ODed on pills
6 6
Less than once a week
N
N
2006-11-12 2006-11-12
Many Thanks in Advance
The easy way to do it (if you know the specific date format) is to use the input function.
09:27
If put(input(var,??yymmdd10.),yymmdd10.)=var then its a date!
else if input(var,best.) ne . then its a number.
Otherwiseits a character string.
This isn't as straightforward as it first looks, so I understand why it would be difficult to search for an answer. Just extracting a number is pretty easy, but when dates are included it becomes a bit more complicated (particularly when the format entered could change, e.g. yyyy-mm-dd, dd-mm-yyyy, dd/mm/yy etc).
One thing to note first. If you want to store the new values as a numeric field then you can't show a mix of numbers and dates. Dates are stored as numbers and formatted to show the date, but you can't apply a format at row level. Therefore I would suggest creating 2 new columns, 1 for numbers and 1 for dates.
My preferred approach is to use the anyalpha function to exclude any records with an alphabetic character, followed by the anypunct function to identify if a punctuation character exists (this should identify dates rather than just numbers). The anydtdte informat is then used to extract the date, this is a very useful informat as it reads dates stored in different ways (as per my note above).
There are clearly some caveats with this method.
If any numbers contain decimals then my method would incorrectly treat these as dates, therefore only integers will be assigned correctly.
It won't pick up dates that contain the month as words, e.g. 15-May-2015, as the anyalpha function would exclude them. They will need to contain numbers only, separated by any punctuation character.
Here's my code.
/* create initial dataset */
data have;
input actual_variable $ 50.;
datalines;
ODed on pills threw them all up - 2006
Y
1
5
ODed on pills
6
Less than once a week
N
N
2006-11-12
;
run;
/* extract dates and numbers */
data want;
set have;
if not anyalpha(actual_variable) then do; /* exclude records with an alphabetic character */
if anypunct(actual_variable) then new_date_variable = input(actual_variable,anydtdte10.); /* if a punctuation character exists then read in as a date */
else new_num_variable = input(actual_variable,best12.); /* else read in as a number */
end;
format new_date_variable yymmdd10.; /* show date field in required format */
run;
Related
I can not find the way to reverse text strings.
For example I want to reverse these:
MMMM121231M34 to become 43M132121MMMM
MM1M11M1 to become 1M11M1MM
1111213111 to become 1113121111
Judging from your examples, what you mean by 'rearrange' is actually 'reverse'.
In that case, you've got the very handy reverse() function in SAS.
Used in context:
data test;
length text $32;
infile datalines;
input text $;
result=reverse(strip(text));
datalines;
MMMM121231M34
MM1M11M1
1111213111
;
run;
EDIT on #Joe's request: in the particular example above, I create the test dataset by setting a length of 32 characters for the text variable. Therefore, when reading the values from datalines, these are padded with blanks up to that total of 32 characters. Hence, when reversing that value, the result has that many blanks at the start, followed by the actual value you are looking for. By adding the strip function, you remove the excess blanks from the value of text before reversing, keeping only the "real" value in the result.
I have to merge two dataset in SAS where in one the key variable is a number where the lentgh is 10 (for example). If the number is shorter than 10 I have a variable number of zeros. For example 00000056471.
While in the other dataset the number is simply 56471. I want to create another variable in the second dataset that add the variable number of zero and use that as key variable for the merge.
How can I fix?
Thank in advance
There are three approaches you could use:
data have;
input ID $10. ;
cards;
143134
12
14356677
12f
oh dear
;run;
data want;
set have;
/* if numeric then convert to number and back using z10 format */
newID=put(input(ID,8.),$z10.);
/* if alphanumeric, right align then replace spaces (will replace ALL spaces) */
newID2=translate(right(ID),'0',' ');
/* probably the best method */
newID3=repeat('0',10-length(ID)-1)||ID;
run;
Documentation on Zw.d format available here.
i have dataset a
data q7;
input trt$;
cards;
a150
b250
c300
400
abc180
;
run;
We have to create dataset b like this
trt dose
a150 150mg
b250 250mg
c300 300mg
400 400mg
abc180 180mg
new dose variable is added & mg is written after each
numeric values
here is my solution - Basically use the compress functions to keep (hence the 'k') only numbers from the trt variable. From there then is just the case of concatenating mg to numbers.
data want;
set q7;
dose = cats(compress(trt,'0123456789','k'),'mg');
run;
The compress function default behaviour is to return a character string with specified characters removed from the original string.
so
compress(trt,'0123456789') would have removed all numbers from the trt variable.
However compress comes with a battery of modifiers that let the user alter the default behaviour.
So in your case, we wanted to keep numbers regardless of the number of preceding letters so I used the modifier k to keep instead the list of characters in this case 012345679
For a full list of modifiers please read the following link
cats is one of the many functions SAS have to concatenate strings, so passing the compress argument as 1st string and mg as 2nd string will concatenate both to produce your desired result
hope it helps
I am looking for a way to convert the characters into numbers in SAS so that I can use the max function. Also, it would be helpful if the characters and only the numbers are kept. Below is a list of data for a column in a SAS table.
Column UNK
abc20140714
abc20140714x
abc20140714xyz
123_abc20140714_xyz
abc20150718
After stripping out the number values from the column, I would then group the data and use the max function in SAS, which should only generate the value 20150718.
To avoid any confusion, my question, is there a way to strip out the non-numeric values, and then convert the column into a numeric column so I can use the max function?
Thanks.
Sure!
var_num = input(compress(var_char,,'kd'),yymmdd8.);
Compress removes or keeps characters from a list. 'kd' says to 'keep digits'.
You then input using the appropriate informat; yymmdd8. looks right based on the data you provide. Then apply a format, format var_num yymmdd8n.; or similar, so it looks like a date visually (even if it's really a number underneath).
As pointed out, this won't work if there are other numeric digits in the values; you need to look at your data and identify how those appear and clean them out separately. You could use a regular expression for example to identify things that have 8 consecutive digits, starting with a 20; but ultimately it is a data analysis issue to handle these as your data require.
To get the first sequence of 8 digits in a row starting with a 1 or a 2 as a numeric value, you can use the following:
data want;
set have;
pos = prxmatch("/[12]\d{7}/", character_string);
if pos > 0 then number = input(substr(character_string, pos, 8), 8.);
else number = .;
drop pos;
run;
The prxmatch expression finds the starting position of the sequence, and the substr expression extracts the sequence, then the input function converts it to a numeric.
(Edited to incorporate Joe's feedback)
In the following code
data temp2;
input id 1 #3 date mmddyy11.;
cards;
1 11/12/1980
2 10/20/1996
3 12/21/1999
;
run;
what do 1 #3 symbols mean ? i presume 1 means that id is the first character in the data . I know that #3 means that date variable starts with the third character , but why is it in front of date whereas 1 is after id?
Because that's a badly written input statement. You can specify input in a number of ways, and that mixes a few different ways to do things which happen to be allowed to mix (mostly). Read the SAS documentation on input for more information.
Some common styles that you can use:
input #1 id $5.; *Formatted input. Allows specification of start position and informat, more useful if using date or other informat that is not just normal character/number.;
input id str $ otherstr $ date :date9.; *List input. This is for delimited text (like a CSV), still lets you specify informat.
input #'ID:' id $5.; *A special case of formatted input. allows you to parse files that include the variable name, useful for old style files and some xml/json/etc. type files;
input x 1-10 y 11-20; *Column input. Not used very commonly as it's less flexible than start/informat style.;
There are other options (such as named input) that are not very frequently used in my experience.
In your specific example, the first variable is read in with column input [id 1 says 'read a 1 character numeric from position 1 into id'] and then the second variable is read with formatted input [#3 date mmddyy11. says 'Read an 11 character date variable from position 3[-13] into a numeric using the date informat to translate it to a number.'] It also says someone gave you that code who isn't very familiar with SAS, since mmddyy10. is the correct informat - the 11th character cannot be helpful.